Let’s face it—distributed streaming is an exciting technology that can be leveraged in many ways. Use cases include messaging, log aggregation, distributed tracing, and event sourcing, among others. Distributed streaming can result in significant benefits for companies that choose to use it, but, when not implemented correctly, it can initiate a frustrating technical debt cycle.
How do you know if you’re properly implementing Kafka in your environment? This article will examine a few of the common mistakes that people make when undertaking this process. Avoiding them will help you start off your Kafka implementation the right way.
Kafka has five primary components: producers, brokers, consumers, topics, and ZooKeepers. Kafka producers push data to brokers. These brokers receive the data and store them in separate topics (more on this later) so that they can be retrieved by consumers. Consumers fetch the data and act upon it in a variety of ways. Kafka ZooKeeper is used to keep track of these components and their activities.
All Kafka messages (whether pushed or pulled) are organized into topics.. Any message you want to write you have to push to a topic and any message you want to read must be pulled from a topic.
A producer publishes a message to a topic and a consumer pulls messages from the topic. Naturally, topic management becomes a key part of deploying and maintaining your Kafka infrastructure. For a more detailed look at Kafka architecture, check out Kafka’s documentation.
If you’re wondering if Kafka is right for you, take a look at your event sources. At present, you may be storing these events in some kind of data lake or data store, such as the Hadoop Distributed File System, or HDFS.
Here are five common mistakes to avoid when planning to implement Kafka:
1. Keeping Too Much Data
Depending on the requirements in your environment, the amount of data Kafka processes can be overwhelming. Without properly tuned data retention settings, your data could be rendered useless.
Data retention is particularly important in Kafka because messages remain in topics, taking up disk space on the brokers until their configurable size is reached or the retention period elapses. These messages remain even if they have already been consumed.
If the data retention period or size is set too low, the data may not be consumed before it is removed from the broker. In this sense, Kafka operates differently from traditional message brokers.
One family of settings—
log.retention.X— manages this in Kafka. Examples include
log.retention.bytes for delineating the max size of a Kafka partition before its deletion,
long.retention.ms that outlines how long to keep the data before it’s deleted.
2. Not Balancing Topics
Inevitably, you will find yourself needing to scale Kafka out to meet the demands of your data streams. At this point, managing your topics becomes a balancing act. It’s important to rebalance your topics to reduce resource bottlenecks and maintain storage efficiency.
There are two things to aspects of topic balancing that need to be addressed: 1) partition leadership balance and 2) partition spread across the Kafka cluster.
Kafka Partition Leadership Balance
On partition leadership, the data in every Kafka partition is replicated across one or more brokers. These are collectively called a replica set. Each replica set has one broker acting as the leader for that partition, and the other brokers in the set are the replicas (for that partition).
The leader is then responsible for communicating with the publishers, consumers and other replica brokers for data transfer.
For this reason the leader is normally more loaded than the replicas.
When a broker goes down, one of the replica brokers assumes leadership for that replica set. If this process is not properly managed, a real-world outcome of a broker loss could be the creation of a hotspot on one of the other brokers, if it assumes leadership on more than its fair share of Kafka partitions.
This can also happen when the downed broker comes back online.
You can configure the Kafka cluster to promote a replica partition using the
auto.leader.rebalance.enable setting should the primary partition or its broker go down. Be warned, though: in the past, this process is known to have caused rebalancing to fail entirely or to have required a rolling restart of all brokers! Thankfully, the provided script,
kafka-preferred-replica-election.sh, enables you to avoid these issues.
Kafka Partition Spread across the Cluster
When adding nodes to your cluster, the cluster will not assume any workload automatically for existing topics—only for new ones. Therefore, it’s important to rebalance your existing topics using the
kafka-reassign-partition.sh script. This script will generate new metadata that guides the Kafka brokers’ management of Kafka partition assignments. This is also true when scaling the cluster. Partitions need to be reassigned away from the broker that is to be terminated prior to removal of the broker from the cluster.
With proper monitoring and execution, these scripts can help rebalance the load more evenly across the cluster.
3. Not Accounting for Long-Term Storage
You may need to store data with Kafka for a long period of time. Kafka stores persistent, checksummed, and replicated data. As a result, it can keep information indefinitely and reprocess it when the proper configuration is established. Unfortunately, Kafka doesn’t allow storage to scale beyond the capacity of a single node. Determine in advance how you want to use the data, since reprocessing it will require a development effort.
Thankfully, long-term storage and data reprocessing aren’t uncommon use cases for Kafka. As mentioned earlier, you can pair Kafka with HDFS or blob storage for additional permanence.
Deploying a reusable reprocessing cluster can also mitigate some of these problems.
4. No Disaster Recovery (DR) Plan
A disaster recovery plan is critical for all of your services, and Kafka is no exception. Kafka has a featured called MirrorMaker that lets you consume data from one cluster and copy it to another, retaining data for disaster recovery purposes.”. Using MirrorMaker, you can manage data retention to reduce costs. In this situation, Kafka consumers and producers can be switched to using the mirrored cluster if the main Kafka cluster goes down. However, the topic offsets are not replicated between Kafka clusters. By creating unique keys in messages, you can avoid this issue. Unique keys will point consumers to the last “checkpoint” where they will resume processing.
5. No API Enforcement
No Kafka consumer or producer uses the exact same amount of system resources. In fact, certain applications will use significantly more of a Kafka cluster’s processing power than others.
Kafka has many ways to modify the balance of the system. One way of maintaining balance is by setting quotas for the Kafka producer and Kafka consumer APIs. Quotas ensure that one system doesn’t hog all of the available resources.
By configuring the quota enforcement per client, you can ensure that all of the messages piped through your Kafka cluster will be able to make it through.
There’s a lot of buzz around Kafka and what it can do for your organization’s data. When properly implemented, Kafka can be a powerful tool for sending and receiving data across your network.
Improper implementations, such as the ones examined in this article, will not only prevent you from processing data more efficiently, they may also set off a technical debt cycle. Whether you have Kafka in production now or you’re evaluating it for your next use case, keep these pitfalls in mind as you implement this potentially powerful tool.