Apache Kafka Architecture — Kafka Component Overview
Let’s start by answering the question “What is Kafka?”.
Kafka is a Distributed Streaming Platform or a Distributed Commit Log.
Kafka is a distributed streaming platform which can be used for building real-time data pipelines and streaming apps. It is a highly scalable, fault-tolerant distributed system.
Though Kafka started as a publish-subscribe messaging system( similar to message queues or enterprise messaging systems but not exactly), over the years it has involved as a complete streaming platform.
The main Kafka components are topics, producers, consumers, consumer groups, clusters, brokers, partitions, replicas, leaders, and followers.
The following diagram offers a simplified look at the interrelations between these components.
Use Case:
- Messaging System
- Activity Tracking
- Gather metrics from different locations
- Application logs gathering
- Stream processing
- Decoupling of system dependencies
- Integrating with Hadoop Spark with big data
Example:
Netflix use: To show to recommendation of TV shows
LinkedIn: To prevent spam
Uber: To gather user, taxi, trip in real data time
Kafka API Architecture
Producer API
The Kafka Producer API enables an application to publish a stream of records to one or more Kafka topics.
Consumer API
The Kafka Consumer API enables an application to subscribe to one or more Kafka topics. It also makes it possible for the application to process streams of records that are produced to those topics.
Streams API
The Kafka Streams API allows an application to process data in Kafka using a streams processing paradigm. With this API, an application can consume input streams from one or more topics, process them with streams operations, and produce output streams and send them to one or more topics. In this way, the Streams API makes it possible to transform input streams into output streams.
Connect API
The Kafka Connector API connects applications or data systems to Kafka topics. This provides options for building and managing the running of producers and consumers, and achieving reusable connections among these solutions. For instance, a connector could capture all updates to a database and ensure those changes are made available within a Kafka topic.
Kafka Cluster Architecture
Topics, Partitions and and offsets
Kafka Topics: A particular stream of data
- Similar to a table in a dataset
- You can have as many topics you wants
- A topic is identified by its name
- Topics are split in Partitions
- Each partition is ordered
- Each msg within a partition get get an incremental id, called offset
- Offset only have a meaning for specific partition
Example: offset3 in partition0 does not represent the same data as offset 3 in Partition1
- Order is guaranteed only in Partition (not cross partition)
- Data is kelp only for a limited period of time(default is 1 week)
- Once the data is written into the partition it can’t be changed(immutability)
- Data is assigned randomly to the partition unless a key is provided
Kafka Brokers:Kafka Broker manages the storage of messages in the topic(s). If Kafka has more than one broker, that is what we call a Kafka cluster.
- A Kafka cluster is made of multiple brokers (servers)
- Each broker is identified by its ID
- Each broker contain certain topic partitions
- After connecting to any broker (called bootstrap broker), you will be connected to the entire cluster
- A good no to started broker is 3 but some big cluster have more than 100 broker
Example: in this ex we choose to no brokers starting at 100 (arbitrary)
Brokers and Topics
Topic Replication Factor
- Topic should have a replication factor > 1(Usually btw 2 and 3)
- This way if a broker is down, another broker can server the data
Example: Topic-A with 2 partition and replication factor of 2
Example: We lost Broker 102
- Result: Broker 101 and 103 still server the data
Concept of Leader for Partition
- At any time only one broker can be a Leader for the giver Partition
- Only that leader can receive and serve the data to a partition
- The other broker with synchronize the data
- Therefore each partition has one leader and multiple ISR(in sync replica)
Kafka Producers:A Kafka producer serves as a data source that optimizes, writes, and publishes messages to one or more Kafka topics. Kafka producers also serialize, compress, and load balance data among brokers through partitioning.
- Producer write data to topics (which is made of partition)
- Producer automatically know to which broker and partition to write to
- Incase of Broker failure, Producer will automatically recover
- Producer can choose to receive acknowledgement of data writes:
You can configure characteristics of acknowledgment on the producer as well.
acks=0: Producer won’t wait for acknowledgement (possibly data loss)
acks=1: Producer will wait for leader acknowledgement (limited data loss)
acks=all: Leader + replicas acknowledgement (no data loss)
Producer Message Keys
- Producer can choose to send a key with msg (string, no etc)
- If key=null, data is sent into Round robin (broker 101 then 102 and thn 103….)
- If a key is sent, then all the msg for that key will always go to the same partition
- A key is basically sent is you need msg ordering for a specific field(ex: truck_id)
Kafka Consumers:Consumers read data by reading messages from the topics to which they subscribe. Consumers will belong to a consumer group. Each consumer within a particular consumer group will have responsibility for reading a subset of the partitions of each topic that it is subscribed to.
- Consumer read data from a topic (identified by name)
- Consumer know which broker to read from
- Incase of broker failure, consumers know how to recover
- Data is read in order (within each partition)
Consumers Group: A Kafka consumer group includes related consumers with a common task. Kafka sends messages from partitions of a topic to consumers in the consumer group. At the time it is read, each partition is read by only a single consumer within the group. A consumer group has a unique group-id, and can run multiple processes or instances at once. Multiple consumer groups can each have one consumer read from a single partition. If the quantity of consumers within a group is greater than the number of partitions, some consumers will be inactive.
- Consumer read data in consumers group
- Each consumer within a group read exclusive partitions
- If u have more consumers than partitions, some consumer will be inactive
Consumer Offsets
- Kafka stores the offset at which a consumer group has been reading
- The offset committed live in a Kafka topic named_consumer_offset
- When a consumer in a group has processed data retrieve from Kafka, it should be committing the offset
- If a consumer dies, it will be able to read from where it left off thanks to the committed consumer offsets
Delivery semantics for consumers
- There are 3 delivery semantics
- At most once
- At least once (usually preferred)
- Exactly once
Kafka Broker Discovery
- Each Kafka broker call “bootstrap server”
- That means you only need to connect only one broker and you will connect to the entire cluster
- Each broker knows about all partition, brokers and Topics (metadata)
Zookeeper: Kafka does not function without zookeeper( at least for now, they have plans to deprecate zookeeper in near future). Zookeeper works as the central configuration and consensus management system for Kafka. It tracks the brokers, topics, and partition assignment, leader election, basically all the metadata about the cluster.
- Zookeeper manager brokers
- It helps in electing the leader for partitions
- Sends notification in case of changes (broker dies, broker create, delete topics… etc)
- Design operates with odd no (3,5,7…)
- Zookeeper has a leader (handle write) other servers are followers (handle read)
- Kafka can not work without Zookeeper
Hope It Helps :)