Before introducing Kafka, let’s first introduce the concept of publish/subscribe model and why it is a critical component of data-driven applications. Publish/subscribe(pub/sub) messaging is a pattern that is characterized by the sender(publisher) of a piece of data(message) somehow, and that receiver(subscriber) subscribes to receive certain classes of messages. This system often have a broker, a central point where messages are published, to facilitate this pattern.

Apache Kafka was developed as a pub/sub messaging system that’s described as a “distributed commit log” or more recently as a “distributing streaming platform”. A commit log is designed to provide a durable record of all transactions so that they can be replayed to consistently build the state of the system. Data within Kafka is stored durably, in order, and can be read deterministically. In addition, the data can be distributed within the system to provide additional protections agains failures.

Messages and Batches

Kafka's unit of data is called a message. Each message can have an optional piece of metadata called a key. Keys are used to control how messages are written to partitions, such as ensuring that messages with the same key are written to the same partition.

For efficiency, messages are written into Kafka in batches. A batch is just a collection of messages, all of which are being produced to the same topic and partition.

Schemas

While messages are just byte arrays to Kafka, it is recommended that additional schema are imposed on the message content so that it can be easily understood. A common choice for Kafka’s schema is Apache Avro.

A consistent data format is important in Kafka, as it allows writing and reading messages to be decoupled. By using well-defined schemas and storing them in a common repository, the messages in Kafka can be understood without coordination.

Topics and Partitions

Messages in Kafka are categorized into topics. A topic can be thought of as a database table or a folder in a filesystem. Each topic is further divided into a number of partitions, where each partition is a single log. Messages are written to each partition in an append-only fashion and are read in order from the beginning to the end. Partitions can be hosted on different servers and can be replicated to achieve fault tolerance.

The term stream is often used when discussing data within systems like Kafka. Most often, a stream is considered to be a single topic of data, regardless the number of partitions. This represents a singe stream of data moving from the producers to the consumers.

Producers and Consumers

Kafka clients are users of the system, and there are two basic types: producers and consumers. There are also advanced client APIs, but they still use producers and consumers as building blocks and provide higher-level functionality on top.

Producers create new messages. A message will be produced to a specific topic, and by default, the producer will balance messages over all partitions of a topic evenly. Producer would also be able to direct messages to specific partitions based on particular needs.

Consumer read messages. The consumer subscribes to one or more topics and reads the messages in the order in which they were produced to each partition. The consumer keeps track of which messages it has already consumed by keeping track of the offset of messages. The offset(an continuous increasing int) is another piece of metadata that Kafka adds to each message as it is produced, and message in the partition has a unique offset. By storing the next possible offset for each partition, a consumer can stop and restart without losing its place.

Consumers work as part of a consumer group, which is one or more consumers that work together to consume a topic. The group ensures that each partition is only consumed by one member. The mapping of a consumer to a partition is often called ownership of the partition by the consumer.

Brokers and Clusters

A Kafka server is referred to as a broker. The broker receives messages from producers, assigns offsets to them, and writes the messages to disk storage. Additionally, it services consumers by responding to fetch requests for partitions and delivering the published messages.

Kafka brokers are designed to operate as part of a cluster. Within a cluster of brokers, one broker will also function as the cluster controller. The controller is responsible for administrative operations, including assigning partitions to brokers and monitoring for broker failures. A partition is owned by a single broker in the cluster, and that broker is called the leader of the partition. A replicated partition is assigned to additional brokers, called followers of the partition. When failover occurs, one follower could take over leadership from leader. All producers must connect to the leader in order to publish messages, but consumers may fetch from either the leader or one of the followers.

A key feature for Kafka is that of retention, which is the durable storage of messages for some period of time. Kafka brokers are configured with a default retention setting for topics(for a period of time or until a size limit is reached). Once limit is reached, messages are expired and deleted. Individual topics can also be configured with their own retention settings so that messages are stored for only as long as they are useful.