Side note: I have a specific note for Apache Kafka, see

Kafka

One big assumption of batch processing is that the input is bounded(has known and finite size), but in reality, lots of data is unbounded as they arrive gradually over time.

The idea behind Stream Processing is thus to “processing a second’s worth of data at the end of every second, or even continuously and process every event as it happens.

Transmitting Event Streams

Idea: what is the equivalence of the filesystem for streaming events?

When input is file, the first parsing step might break down it to a sequence of records. In a stream processing context, a record is known as an event — a small, self-contained, immutable object. An event usually contains a timestamp as well.

In batch processing, a file is written once and then potentially read by multiple jobs. In streaming terminology, an event is generated once by a producer, then potentially processed by multiple consumers. In a streaming system, related events are usually grouped together into a topic or stream.

Messaging Systems

A common approach for notifying consumer about new events is to use a messaging system: a producer sends a message containing the event, which is then pushed to consumers.

Unlike the direct communication channel(TCP, Unix Pipe), a messaging system allows multiple producer nodes to send messages to the same topic and allows multiple consumer nodes to receive message in a topic. This is called the publisher/subscribe model. And here are something to consider to differentiate systems in this model:

  1. What happens if the producer send messages faster than the consumers can process them?

    i.e. drop messages, buffer messages in a queue, or apply backpressure(flow control)

  2. What happens if nodes crash or temporarily go offline — are any messages lost?

Direct messaging from producers to consumers

The kind of message system that use direct network communication between producer and consumers without going via intermediary nodes. One classic example is UDP multicast with application-level protocols to recover lost packets.

Since in this approach, the application code has to be aware of the possibility of message loss, the fault tolerant level are very limited: they generally assume that producers and consumers are constantly online.

Message Brokers

Also known as message queues, which is a kind of database that is optimize for handling message streams. It runs a server with producers and consumers connecting to it as clients. Producer write messages to the broker, and consumers receive them by reading them from the broker.

By centralizing the data in the broker, the durability would totally depends on the broker as well. Some message brokers only keep messages in memory, while others might write them to disk. Depends on the configuration, some broker would allow unbounded queueing and some would drop messages or backpressure.

A consequence of queuing is also that consumers are generally asynchronous. The delivery to consumers will happen at some undetermined future point in time.

Message Brokers compare to databases