Structured Streaming

Stream processing is defined as the continuous processing of endless stream of data. More of that at

Traditionally, distributed stream processing has been implemented with a record-at-a-time processing model like the following

Untitled

The processing pipeline is composed of a directed graph of nodes: each node continuously receives one record at at a time, process it, and then forwards the generated record(s) to the next node in the graph. This model can achieve very low latencies, but it’s not very efficient at recovering from node failures and slow nodes.

The Advent of Micro-Batch Stream Processing

Spark Stream(DStreams) introduced the idea of micro-batch stream processing, where the streaming computation is modeled as a continuous series of small, map/reduce-style batch processing jobs on small chunks of the stream data

Untitled

Spark streaming divides the data from the input stream into smaller batches. Each batch is processed in the Spark cluster in a distributed manner with small deterministic tasks that generate the output in micro-batches. Doing so will have the 2 advantages:

Fault tolerance: Spark’s agile task scheduling can very quickly and efficiently recover from failures and straggler executors by rescheduling one or more copies of the tasks on any of the other executors
The deterministic nature of the tasks ensures the output data is the same no matter how many times the task is re-executed, which enables Spark Streaming to provide end-to-end exactly-once processing guarantees.

Although doing so would introduce extra latency compare to the traditional approach from milliseconds to few seconds, most of pipelines either don’t need latencies fewer than few seconds, or they’ll be hindered by delays elsewhere in the pipeline.

The DStream API was built upon Spark’s batch RDD API as well, therefore having the same functional semantics and fault-tolerance model as RDDs.

Several Fallbacks of Spark Streaming

Lack of a single API for batch and stream processing: developers had to explicitly rewrite their code to use different classes when converting batch job to streaming jobs
Lack of separation between logical and physical plans: there’s no scope for automatic optimization from Spark since Spark Streaming executes the operations in the same sequence they were specified
Lack of native support for event-time windows: DStreams define window operations based on the time when each record is received(processing time). However, many use cases need to calculate windowed aggregates based on the time when the record were generated(event time).

Philosophy of Structured Streaming

For developers, writing stream processing pipelines should be as easy as writing batch pipelines.

A single, unified programming model and interface for batch and stream processing.