TIL: Tuning Linux Networking stack may help to fix Kafka's URP

While I had previously only applied common Linux kernel tuning recommendations for Apache Kafka [1]—such as adjusting the number of open files— a recent investigation into Under-Replicated Partitions led me to explore deeper OS-level optimizations. An increase on network packet loss along with an increase on TCP memory usage hinted us that this may be related with networking.

This challenge presented a great opportunity to collaborate with our SRE team at Aiven and investigate the Linux kernel networking stack [2], so we dived into it.

On the Linux network stack there are layers of software and hardware components that could be tuned to allow for more scalable network packet processing.

Starting at the Hardware-level, RSS (Receive-Side Scaling) feature allows for (modern) NICs to distribute packets across multiple CPUs based on NIC queues. For most systems, RSS is enough for performant packet processing; but for larger systems where the load is high and the number of CPUs may be higher than the number of NIC queues there is room for further tuning at the software level.

RPS (Receive Packet Steering) and RFS (Receive Flow Steering) are both kernel features that increase and tune the distribution and affinity of packet processing beyond RSS limits. At this level, a flow is identified by hash(src_ip, dst_ip, src_port, dst_port). RPS offers a deterministic distribution where the same flow goes to the same CPU based on the hash; and RFS goes a step further and distributes the flow dynamically by tracking which CPU is processing the application socket, selecting the CPU based on the application behavior.

Our analysis revealed that we could achieve lower latency by enabling CPU affinity with RPS, ensuring that connections between specific source and target hosts and ports consistently used the same CPU. This was enough to unlock the processing bottleneck and bring the cluster back to a healthy state: TCP memory consumption dropped to average values, packet loss was eliminated, and most significantly, Under-Replicated Partitions (URPs) went to zero even under high load.

For instance the following code enables RPS by evenly distributing the flows buffered across the available CPUs:

# Get CPUs to distribute RPS flows
NUM_CPUS=$(nproc)
# This generates a mask like "f" for 4 CPUs, "ff" for 8 CPUs, "ffff" for 16 CPUs, etc.
cpus=$(printf '%x' $((2**$NUM_CPUS - 1)))
# Distribute RPS flows evenly
BUFFER_SIZE=16384
echo $BUFFER_SIZE | sudo tee /proc/sys/net/core/rps_sock_flow_entries;
num_queues=$(ls /sys/class/net/eth0/queues/ | grep rx | wc -l);
flow_cnt=$(($BUFFER_SIZE / num_queues));
for i in $(seq 0 $((num_queues - 1))); do
  echo $flow_cnt | sudo tee /sys/class/net/eth0/queues/rx-$i/rps_flow_cnt;
  echo $cpus | sudo tee /sys/class/net/eth0/queues/rx-$i/rps_cpus;
done;

NUM_CPUS=$(nproc); cpus=$(printf '%x' $((2**$NUM_CPUS - 1)))
Determines the number of CPUs available for RPS
BUFFER_SIZE=16384; echo $BUFFER_SIZE | sudo tee /proc/sys/net/core/rps_sock_flow_entries:
Configures the system-wide RPS flow entry limit
Determines how many concurrent network flows can be tracked
May be available through net.core.rps_sock_flow_entries, haven’t confirmed
num_queues=$(ls /sys/class/net/eth0/queues/ | grep rx | wc -l):
Counts the number of receive (RX) queues on the network interface
Multiple queues allow parallel processing of network packets
flow_cnt=$(($BUFFER_SIZE / num_queues)):
Calculates flow entries per queue
Distributes the total buffer size evenly across all queues
The for loop:
for i in $(seq 0 $((num_queues - 1))); do
  echo $flow_cnt | sudo tee /sys/class/net/eth0/queues/rx-$i/rps_flow_cnt;
  echo $cpus | sudo tee /sys/class/net/eth0/queues/rx-$i/rps_cpus;
done
Configures each RX queue with its allocated flow count
Ensures even distribution of processing across CPU cores

While our specific issue was resolved through receive-side optimizations, it’s worth mentioning that XPS (Transmit Packet Steering) provides similar benefits for outgoing traffic. We plan to evaluate this optimization in the future.

It’s worth mentioning as well that these optimizations come with their own trade-offs:

Increased resource utilization: more memory used to keep the flow entries stored, overhead of additional flow hash calculation, and packets using all CPUs.
System-specific benefits: while these optimizations have helped on Kafka high-load clusters (e.g., >100MB/s in/out), their benefits will be dependent on the hardware, workload, and traffic patterns. For instance, at a certain point the flow hash calculation may be detrimental to the latency compared to simpler configuration on some hardware.

[]($section.id(‘refs’)

References

[1] Kafka documentation for OS tuning: https://kafka.apache.org/documentation/#os
[2] Kernel documentation on scaling networking stack: https://www.kernel.org/doc/html/latest/networking/scaling.html

Updates:

2025-03-15: Expanded the script to include rps_cpus configuration for each RX queue. This is key to ensure that the flows are distributed across the available CPUs.