Stream processing is defined as the continuous processing of endless stream of data. More of that at
Traditionally, distributed stream processing has been implemented with a record-at-a-time processing model like the following
The processing pipeline is composed of a directed graph of nodes: each node continuously receives one record at at a time, process it, and then forwards the generated record(s) to the next node in the graph. This model can achieve very low latencies, but it’s not very efficient at recovering from node failures and slow nodes.
Spark Stream(DStreams) introduced the idea of micro-batch stream processing, where the streaming computation is modeled as a continuous series of small, map/reduce-style batch processing jobs on small chunks of the stream data
Spark streaming divides the data from the input stream into smaller batches. Each batch is processed in the Spark cluster in a distributed manner with small deterministic tasks that generate the output in micro-batches. Doing so will have the 2 advantages:
Although doing so would introduce extra latency compare to the traditional approach from milliseconds to few seconds, most of pipelines either don’t need latencies fewer than few seconds, or they’ll be hindered by delays elsewhere in the pipeline.
The DStream API was built upon Spark’s batch RDD API as well, therefore having the same functional semantics and fault-tolerance model as RDDs.
For developers, writing stream processing pipelines should be as easy as writing batch pipelines.