Stream processing is defined as the continuous processing of endless stream of data. More of that at

Stream Processing

Traditionally, distributed stream processing has been implemented with a record-at-a-time processing model like the following

Untitled

The processing pipeline is composed of a directed graph of nodes: each node continuously receives one record at at a time, process it, and then forwards the generated record(s) to the next node in the graph. This model can achieve very low latencies, but it’s not very efficient at recovering from node failures and slow nodes.

The Advent of Micro-Batch Stream Processing

Spark Stream(DStreams) introduced the idea of micro-batch stream processing, where the streaming computation is modeled as a continuous series of small, map/reduce-style batch processing jobs on small chunks of the stream data

Untitled

Spark streaming divides the data from the input stream into smaller batches. Each batch is processed in the Spark cluster in a distributed manner with small deterministic tasks that generate the output in micro-batches. Doing so will have the 2 advantages:

Although doing so would introduce extra latency compare to the traditional approach from milliseconds to few seconds, most of pipelines either don’t need latencies fewer than few seconds, or they’ll be hindered by delays elsewhere in the pipeline.

The DStream API was built upon Spark’s batch RDD API as well, therefore having the same functional semantics and fault-tolerance model as RDDs.

Several Fallbacks of Spark Streaming

Philosophy of Structured Streaming

For developers, writing stream processing pipelines should be as easy as writing batch pipelines.