Apache Kafka: Key concepts
A Snapshot of Key Concepts
Why Kafka ?
Recently I find myself dealing more and more with Apache Kafka architectures when working with cloud-native.
I put this “short” introduction to Kafka terminology and basic functions to help others get their feet wet in the field of event-based software architectures.
Apache Kafka is widely used in modern systems for real-time data streaming and communication between different services.
It handles large amounts of data efficiently, making it a good choice for systems that need to process information quickly.
Key concepts and workflows
The unit of data in Kafka is known as a message.
A message is an array of bytes, meaning the data it contains does not have any specific format or significance to Kafka.
A message may include an optional piece of metadata called a key. Keys are used when messages need to be written to partitions in a more controlled way.
To optimize performance, messages are written to Kafka in batches. A batch consists of a collection of messages, all produced for the same topic and partition.
A partition is a single log. Messages are appended to it and are read sequentially from start to finish.
A topic generally contains multiple partitions, but there is no guarantee of message ordering across the entire topic, only within individual partitions.
Partitions are how Kafka provides scalability and redundancy.
Each partition can be hosted on a different server (Called a broker), enabling a topic to be scaled horizontally across multiple servers.
Partitions can be replicated, allowing different servers to store copies of the same partition in case one server fails.
A stream is considered a single topic of data, regardless of how many partitions it contains. This represents a continuous flow of data from producers to consumers.
Kafka Clients
Kafka clients are users of the system, divided into two main types: producers and consumers.
Producers generate new messages for a specific topic.
Consumers read messages from one or more topics in the order they were produced to each partition. Consumers track which messages they have already consumed by noting the message offsets.
The offset, a constantly increasing integer, is another piece of metadata Kafka assigns to each message when produced.
By storing the next available offset for each partition, a consumer can stop and restart without losing its place in the stream.
Consumers operate within a consumer group, consisting of one or more consumers that collaborate to consume a topic.
The group ensures that each partition is only consumed by one member at a time.
The assignment of a consumer to a partition is often referred to as ownership of that partition by the consumer. If one consumer fails, the other members of the group will reassign the partitions to take over for the missing member.
Kafka Infrastructure
A single Kafka server is known as a broker. The broker receives messages from producers, assigns offsets, and writes them to storage on disk.
The broker also serves consumers by responding to their requests for partition data with the published messages.
Kafka brokers are designed to operate within a cluster.
Each partition is owned by a single broker in the cluster, known as the partition’s leader.
Replicated partitions are assigned to additional brokers, which are referred to as followers of the partition.
Replication ensures message redundancy in a partition, so one of the followers can assume leadership if the leader broker fails.
Producers must connect to the leader to publish messages, while consumers may fetch messages from either the leader or one of the followers.
Retention in Kafka refers to the durable storage of messages for a set period of time.
Kafka brokers have a default retention configuration for topics, retaining messages either for a specified time period or until the partition reaches a defined size in bytes.
We begin producing messages to Kafka by creating a message, which must include the topic we want to send the record to and a value. Optionally, we can also provide a key, a partition, a timestamp, and/or a set of headers.
Once we send the message, the producer first serializes the key and value objects into byte arrays so they can be transmitted over the network.
If we do not explicitly specify a partition, the data is sent to a partitioner. The partitioner selects a partition for us, typically based on the message key.
When the broker receives the messages, it sends back a response. If the messages were successfully written to Kafka, it returns a Message Metadata object containing the topic, partition, and the offset of the message within the partition (see figure above).
What about Apache Avro ?
Apache Avro is a language-neutral data serialization format.
A common architectural pattern involves using a Schema Registry.
The Schema Registry is not a component of Apache Kafka.
The concept is to store all the schemas used to write data to Kafka within the registry.
Many Kafka developers prefer to use Apache Avro as a serialization framework.
By using well-defined schemas stored in a shared repository, Kafka messages can be interpreted without the need for coordination.
Outro
Apache Kafka is great for handling real-time data, offering easy message management and scalability.
It can use tools like the Schema Registry and formats such as Apache Avro to simplify data handling.
This makes Kafka a reliable choice for building fast and efficient systems.
Before you go 😉
I hope this “quick” summary was insightful! More articles about Apache Kafka will be shared soon.
Let’s connect and continue the conversation.
In the coming articles, we will dive deeper into Kafka waters and learn how Kafka can be managed within production-ready architectures to build real-world data pipelines.