Scalable Kafka through simple examples: Part 1

October 28, 2024

https://www.zjalicf.me/posts/kafka-series/ zjalicf

This blog series aims to provide insights into configuring Apache Kafka for a production-grade environment. The focus will be on exploring various configuration options and understanding how they interact and impact each other to achieve specific outcomes, often at the expense of other factors. We will develop a sample project, beginning with a POC to better define requirements and determine how to achieve them by fine-tuning Kafka. While the solution will be deployed locally using Docker Compose, minimal adjustments will be needed to adapt it to a lightweight K3s or AWS ECS deployment or even a more robust production setup.

## The Parcel Company

We will work with a CTO of a parcel company called Packetz, who will give us some insight into how the business operates, and later we will see how to translate this into configuration. We had a meeting and a tour of the warehouse and, as CTOs usually do, sat down for lunch to discuss details. This resulted in a questionnaire-like list of requirements that looks something like this:

A: "We operate one main warehouse processing about 50,000 parcels daily, with peak hours hitting around 5,000 parcels per hour. We have 3 sorting lines, 2 packing stations, and 2 shipping lanes."
Explanation: This requirement defines our baseline throughput needs.
We need:
- Design for 1.5x peak capacity (≈7,500 parcels/hour)
- Ability to handle approximately 83 events per minute per sorting line
- Parallel processing capabilities
- We can achieve this by implementing 6 partitions per topic, allowing room for scaling while maintaining ordered processing within each sorting line.

Q: What happens to a parcel from the moment it enters your warehouse?
A: "Each parcel goes through four main stages: First, registration at receiving dock where we scan and weigh it. Then it moves to sorting where it's assigned a bin based on destination. After sorting, it goes to packing where we prepare it for shipping. Finally, it reaches the shipping stage where it's labeled and loaded."
Explanation: This defines our event flow and topic structure.
We need:
- Four main topics representing each stage
- State tracking between stages
- Clear event boundaries

Q: What happens if a system failure occurs during parcel processing?
A: "We cannot lose any parcel information. If a system fails, we need to know exactly where each parcel was in the process. Some parcels might need reprocessing, but we can never process the same parcel twice at the same stage."
We need:
- Exactly-once processing semantics
- Message deduplication
- State recovery capabilities

Q: How quickly does each stage need to process parcels?
A: "Registration and sorting must keep up with incoming flow - about 80-90 parcels per minute per line during peak times. Packing and shipping can be slightly slower as we have buffer zones, but shouldn't fall more than 100 parcels behind."
We need:
- Minimum throughput of 90 events/minute/line
- Consumer lag monitoring
- Buffer management

Q: What types of errors occur during processing?
A: "Most common issues are unreadable barcodes at registration, sorting machine jams, incorrect weight measurements, and network connectivity issues. We need to handle these without losing track of parcels."
We need:
- Dead Letter Topics for each processing stage
- Error classification system
- Retry mechanisms

Q: How long do you need to keep parcel processing data, and what data is most important?
A: "We need 7 days of full operational data for troubleshooting. Processing times and error rates need to be kept for 30 days for analysis. Each parcel generates about 1KB of data per stage, and failed parcels generate additional error data."
We need:
- Calculated storage needs: ~50,000 parcels × 4 stages × 1KB = 200MB daily
- Plans for peak periods (3x normal) = ~600MB daily
- Error events storage

Q: How important is the order of parcel processing?
A: "Parcels must be processed in order within each sorting line to prevent conveyor jams. Different sorting lines can work independently. Express parcels should be prioritized but maintain their relative order within the line."
We need:
Perseverance of order within sorting lines
Enableing parallel processing across lines
Priority handling without breaking order

Q: How do you plan to handle growth in parcel volume?
A: "We expect to add two more sorting lines within 6 months. During holiday seasons, volumes can spike to 3x normal. We might add a second warehouse next year."
We need:
- Horizontal scaling capability
- Multi-warehouse support
- Dynamic resource allocation

Q: What kind of operational visibility do you need for day-to-day operations?
A: "We need basic visibility into consumer groups, topics, and broker health. For now, we just need to track consumer lag, see which consumers are active, and monitor basic broker health."
We need:
- Basic operational visibility
- Consumer group tracking
- Topic monitoring

Q: What specific metrics do you need to track for each parcel processing stage?
A: "We need to know how long each stage takes to process a parcel, how many errors occur, and identify any bottlenecks. This helps us optimize our warehouse operations and identify issues quickly."
We need:
- Processing time tracking
- Error rate monitoring
- Analytics using DLQ

## Plan

Following parts will focus on implementing the stated needs for our Kafka processes, we will include step-by-step explanations of each configuration made, specifically for the server.properties parameters. We will focus on container-based issues, Kafka and overall system metrics performance, as well as many other smaller tasks still very important for the overall project.

I plan to update this series as I come up with possible improvements, which will be added as specific parts prefixed with [FIXES] or something similar. That way, we get to see differences and improvements, just like you would do in a real-world system. I had an idea to include a Kanban style board of tasks to make it even more detailed, but that just adds unnecessary complexity for me. Any format changes to the series will be noted visibly.

## Requirements

If you want to follow along, you will need Docker && docker-compose (Use this handy script for the installation).

Images used:

bitnami/kafka:3.7.1 (for Kafka and kafka-init service, used for topic creation etc.)
danielqsj/kafka-exporter:v1.8.0
prom/prometheus:v2.55.0
grafana/grafana:11.3.0
tchiotludo/akhq:0.25.1
warehouse-service (our app)