Kafka on Kubernetes: Performance Lessons for Any Disk-Heavy Data Service
We recently started migrating Kafka clusters from EC2 to EKS using Strimzi. The goal was not to chase new features, but to reduce the operational overhead of running large stateful clusters by hand. Upgrades, configuration changes, instance-family replacements, and failure recovery all required too much manual coordination on EC2. We wanted a model that gave us: Declarative configuration. Simpler upgrades. Easier infrastructure changes. Better self-healing. Less day-to-day operational toil. That part worked. What did not work was the performance profile after the migration. As soon as we moved the first cluster, we saw persistent disk reads across the brokers and higher latency than we expected on comparable hardware. For Kafka, disk reads are not just a storage detail. In normal operation, when consumers stay near the head of the log and the brokers have enough memory, hot data should usually be served from page cache instead of disk. Once reads start falling through to storage, the symptoms become hard to ignore: Latency increases. Throughput becomes less predictable. Storage does more work than it should. The cluster behaves differently from the mental model you rely on. That is why this stood out immediately. We were not looking at a harmless metric difference. We were looking at a performance path that should not have been active so often. This post explains how we investigated the issue and what we found. Although Kafka exposed the problem, the root cause was broader: an interaction between Kubernetes, cgroup v2, Linux reclaim behavior, and disk-backed workloads. By the end of this article, you should have a clearer picture of: Why some data services start reading from disk more than expected on Kubernetes. Which signals help distinguish a Kafka problem from a kernel or memory-management problem. What to inspect before changing application-level settings. Which kernel and reclaim-related knobs are worth understanding. How to think about tuning other stateful services that behave strangely on Kubernetes. After migrating our first Kafka cluster to Kubernetes with Strimzi, we noticed something unusual right away: the brokers were doing consistent, non-trivial disk reads. For our workload, that was a red flag, because these clusters handle very high throughput and small latency regressions show up quickly in broker performance. The graph mattered because this was not an isolated spike or a recovery event. It was steady read activity under normal operation, on hardware that was supposed to behave similarly to our EC2 deployment. Kafka relies heavily on the operating system page cache rather than implementing its own buffering layer. In a healthy cluster, when consumers are reading near the head of the log and the node has enough available memory, most hot reads should come from memory instead of disk. That is why these reads got our attention immediately: Memory hits are much faster than disk reads. Falling through to storage adds latency. Sustained reads often mean page cache is being reclaimed too aggressively. The key point was simple: for this workload, disk reads were not just a storage metric. They were a latency signal. Our first suspicion was consumer lag. That would have been the simplest explanation: if consumers were reading older offsets, the relevant data might no longer be in page cache, forcing the kernel to fetch it from disk. We checked consumer lag using our Kafka lag exporters and monitoring dashboards and found no meaningful lag. Consumers were reading close to the head of the log, so lag alone could not explain the persistent reads. Takeaway: the reads were real, but they were not caused by consumers falling behind. Once we ruled out lag, the next question was straightforward: what changed between the old and new environments? We compared the obvious candidates: Kafka configuration, including topic settings, compression, and broker configs. Linux sysctl tuning. Instance sizing, including CPU and memory. Those were effectively unchanged. The meaningful differences were lower in the stack: We moved from Ubuntu 20.04 to Amazon Linux 2023. We moved from cgroup v1 to cgroup v2. That narrowed the investigation to operating-system memory behavior rather than Kafka itself. To see what the kernel was doing, we used writeback.bt, a bpftrace script that shows why pages are being written back. This was useful because it distinguishes between ordinary background writeback and reclaim-driven writeback caused by memory pressure. On the new machine, many writeback events were tagged as vmscan: bpftrace ./writeback.bt Attaching 4 probes… Tracing writeback… Hit Ctrl-C to end. TIME DEVICE PAGES REASON ms 13:06:59 259:0 2385 vmscan 0.006 13:06:59 259:0 2385 vmscan 0.000 13:06:59 259:0 26476 periodic 0.000 13:06:59 259:0 38518 vmscan 0.002 13:06:59 259:0 2397 vmscan 0.000 13:06:59 259:0 2397 vmscan 0.000 On the old machine, writeback was dominated by background and periodic events instead: bpftrace ./writeback.bt Attaching 4 probes… Tracing writeback… Hit Ctrl-C to end. TIME DEVICE PAGES REASON ms 13:07:59 259:0 2945 periodic 0.006 13:07:59 259:0 25613 periodic 0.000 13:07:59 259:0 26476 background 0.000 13:07:59 259:0 38518 background 0.000 13:07:59 259:0 2107 background 0.000 13:07:59 259:0 2645 periodic 0.000 That difference was the first strong kernel-level signal. The new setup was doing much more reclaim-driven writeback, which meant page cache was under pressure. vmscan is part of the kernel reclaim path. When it shows up in writeback traces, it usually means the kernel is actively reclaiming memory rather than performing routine background flushing. In practice, that meant the system was paying a reclaim penalty we did not expect on equivalent hardware. At that point, the question was no longer “why is Kafka reading from disk?” but “why is the kernel reclaiming page cache so aggressively in this environment?” This led us to cgroup v2 memory behavior, and specifically to memory.high. Under cgroup v2, once a workload crosses the high memory threshold, the kernel can start applying reclaim pressure inside that cgroup even if the node still has free memory available. That is a poor fit for Kafka and similar disk-backed systems: They benefit from using available memory for page cache. High-throughput traffic can push them into reclaim pressure quickly. Once reclaim starts inside the cgroup, hot pages are evicted sooner. More reads then fall through to disk, increasing latency. In other words, the issue was not that the node had too little RAM in absolute terms. The issue was that the memory-reclaim behavior changed under Kubernetes with cgroup v2. At first, we tried the obvious Kafka-style tuning knobs such as vm.dirty_ratio and vm.dirty_background_ratio. Those settings influence how much dirty data the kernel allows before forcing writeback. They helped control writeback behavior, but they did not solve the real problem. Reclaim-driven writeback remained, and disk reads still stayed above the old baseline. Takeaway: dirty page tuning was not enough, because the main issue was reclaim pressure rather than ordinary flushing. The first meaningful fix was to stop setting memory limits for Kafka pods and rely on requests plus dedicated nodes instead. That avoided triggering pod-level reclaim pressure through cgroup memory controls while the host still had memory available. This worked for our setup because: Kafka ran on dedicated nodes. Only minimal system and DaemonSet workloads shared those nodes. Capacity was managed at the node level rather than through strict pod memory caps. That change gave Kafka more freedom to benefit from host page cache instead of being boxed into an artificial reclaim boundary. Warning: Do not treat this as a general Kubernetes best practice. Removing pod memory limits on shared nodes can push the node into memory pressure, which may trigger pod eviction and interrupt the workload. We only used this approach because Kafka was isolated on dedicated nodes and capacity was controlled at the node level. The second fix was to tune vm.min_free_kbytes. This setting influences the kernel watermarks that determine when kswapd wakes up and starts reclaiming memory. On our nodes, the default value was: sysctl -a | grep min_free_kbytes vm.min_free_kbytes = 67584 We increased it gradually and observed: Earlier background reclaim by kswapd. Fewer direct reclaim events. Less vmscan-driven writeback under the same workload. For our hardware, the best result came from: vm.min_free_kbytes = 2548576 That number is workload-specific, so I would not present it as a universal recommendation. The important lesson is the mechanism: raising the watermarks helped the kernel reclaim earlier and more smoothly, instead of falling into harsher reclaim behavior later. After applying both changes, the difference was obvious: Remove Kafka pod memory limits. Tune vm.min_free_kbytes on the node. Under the same workload, disk read bytes dropped close to zero and normalized load average decreased by about 50%. If you are troubleshooting similar behavior in Kafka or another disk-backed service on Kubernetes, I’d start here: Run the workload on dedicated nodes. Avoid pod memory limits if the service depends heavily on page cache. Compare cgroup behavior, not just Kafka configs. Use bpftrace or similar tools to distinguish normal writeback from reclaim-driven writeback. Inspect reclaim-related tuning such as vm.min_free_kbytes, not only dirty page ratios. If you want to dig deeper into the ideas behind this investigation, these references are worth reading: Kafka design: Don’t Fear the Filesystem Confluent: Kafka and the filesystem Linux page cache for SREs Understanding the Linux page frame reclamation algorithm Oracle Linux blog: Anticipating your memory needs cgroup v2 admin guide bpftrace If you’ve run into similar page-cache, reclaim, or cgroup v2 behavior in Kafka or other stateful workloads on Kubernetes, I’d be interested to compare notes. LinkedIn GitHub
