Friday, November 28, 2014

Spark Streaming

In this post, we'll discuss another important topic of big data processing: real-time stream processing area.  This is an area where Hadoop falls short because of its high latency, and another open source framework Storm is developed to cover the need in real-time processing.  Unfortunately, Hadoop and Storm provides quite different programming model, resulting in high development and maintenance cost.

Continue from my previous post on Spark, which provides a highly efficient parallel processing framework.  Spark streaming is a natural extension of its core programming paradigm to provide large-scale, real-time data processing.  The biggest benefits of using Spark Streaming is that it is based on a similar programming paradigm of its core and there is no need to develop and maintain a completely different programming paradigm for batch and realtime processing.


Spark Core Programming Paradigm Recap

The core Spark programming paradigm consists of the following steps ...
  1. Taking input data from an external data source and create an RDD (a distributed data set across many servers)
  2. Transform the RDD to another RDD (these transformation defines a direct acyclic graph of dependencies between RDD)
  3. Output the final RDD to an external data source


Notice that the RDD is immutable, therefore the sequence of transformations is deterministic and therefore recovery from intermediate processing failure is simply by tracing back to the parent of the failure node (in the DAG) and redo the processing from there.


Spark Streaming

Spark Streaming introduce a data structure call DStream which is basically a sequence of RDD where each RDD contains data associated with a time interval.  DStream is created with a frequency parameters which defines the frequency RDD creation into the sequence.

Transformation of a DStream boils down to transformation of each RDD (within the sequence of RDD that the DStream contains).  Within the transformation, the RDD inside the DStream can "join" with another RDD (outside the DStream), hence provide a mix processing paradigm between DStream and other RDDs.  Also, since each transformation produces an output RDD, the result of transforming a DStream results in another sequence of RDDs that defines an output DStream.

Here is the basic transformation where each RDD in the output DStream has a one to one correspondence with each RDD in the input DStream. 


Instead of performing a 1 to 1 transformation of each RDD in the DStream.  Spark streaming enable a sliding window operation by defining a WINDOW which groups consecutive RDDs along the time dimension.  There are 2 parameters that the window is defined ...
  1. Window length: defines how many consecutive RDDs will be combined for performing the transformation.
  2. Slide interval: defines how many RDD will be skipped before the next transformation executes.


By providing a similar set of transformation operation for both RDD and DStream, Spark enable a unified programming paradigm across both batch and real-time processing, and hence reduce the corresponding development and maintenance cost.

No comments: