Leader Balancing
Each Kahuna partition is an independent Raft group with one leader. The leader processes requests, replicates writes, and sends heartbeats for that partition.
Leaders are elected independently. Over time, one node can end up leading many more partitions, or several high-traffic partitions, while other nodes have spare capacity.
The leader balancer gradually redistributes leadership using two signals:
- Leader count spreads partition leaders more evenly across nodes
- Partition load separates hot leaders using operation throughput and queue depth
Before After
node-a: 12 leaders node-a: 4 leaders
node-b: 2 leaders node-b: 4 leaders
node-c: 2 leaders node-c: 4 leaders
The balancer never moves partition data. It suggests normal Raft leadership transfers between existing replicas. The current leader validates its term and ownership before performing a handoff.
When to Enable It
Enable leader balancing when both conditions apply:
- The deployment has multiple Kahuna nodes and many partitions
- Leader counts or hot partitions are unevenly concentrated across nodes
Leave it disabled for standalone and embedded deployments. A single node has nowhere to transfer leadership. It can also remain disabled for a small cluster whose leaders are already distributed acceptably.
Leader balancing is a performance optimization, not a correctness requirement. It is disabled by default and adds no reporting or planning overhead while disabled.
Enable the Balancer
Enable it on every cluster node and perform a rolling restart:
dotnet Kahuna.Server.dll \
--raft-enable-leader-balancer true \
<other-server-options>
Every node must participate because each node publishes local load reports and receives transfer suggestions. Planning runs only on the current leader of system partition 0, which can change at any time.
A partial rollout produces incomplete reports. The planner responds by skipping passes rather than making decisions from an incomplete cluster view.
How a Balancing Pass Works
- Every enabled node periodically publishes the partitions it leads and their load
- The partition
0leader collects reports from active nodes - The planner first corrects leader-count imbalance
- When counts are balanced, it can swap leaders to reduce load imbalance
- Busy nodes receive advisory transfer suggestions
- Each current partition leader validates and performs its own Raft handoff
- Later reports confirm the new leadership distribution
Suggestions and planner state are held in memory. If the partition 0 leader changes, the new leader rebuilds its view from subsequent reports.
Configuration
Start with the defaults. They intentionally move leadership slowly to limit churn.
| Server option | Default | Description |
|---|---|---|
--raft-enable-leader-balancer | false | Enables reports, planning passes, and transfer suggestions |
--raft-leader-balancer-interval | 30000 ms | Interval between planning passes on the partition 0 leader |
--raft-leader-balancer-report-interval | 5000 ms | Interval between load reports from each node |
--raft-leader-balancer-report-ttl | 20000 ms | Maximum report age accepted by the planner. Must exceed the report interval |
--raft-count-deadband | 1 | Allowed leader-count deviation before count balancing starts |
--raft-load-imbalance-threshold | 0.25 | Fractional load skew required for load-based swaps after counts are balanced |
--raft-min-leader-stability-ms | 5000 ms | Minimum leadership age before a partition can move |
--raft-move-cooldown | 60000 ms | Minimum delay before the same partition can move again |
--raft-max-moves-per-pass | 4 | Maximum suggestions created by one planning pass |
--raft-max-concurrent-transfers | 2 | Maximum transfer suggestions tracked at the same time |
--raft-suggestion-timeout | 15000 ms | Time allowed for a suggested transfer to appear in reports |
--raft-leader-balancer-ops-weight | 1.0 | Weight assigned to operations per second in the load score |
--raft-leader-balancer-queue-weight | 0.5 | Weight assigned to pending queue depth in the load score |
The partition load score is approximately:
load = ops weight * operations per second
+ queue weight * pending queue depth
Tuning
Change settings only for an observed problem:
- Convergence is too slow: lower
--raft-leader-balancer-intervalor raise the move and concurrency limits - Leadership moves too often: raise
--raft-count-deadband,--raft-load-imbalance-threshold, or--raft-move-cooldown - Hot partitions remain together: increase the operations or queue weight that best represents the workload bottleneck
- Successful transfers time out: increase
--raft-suggestion-timeoutso the new leader report has time to reach the planner - Planning passes are skipped: verify every node has balancing enabled and keep the report TTL comfortably above the report interval
Faster convergence creates more leadership changes. Prefer gradual movement unless a measured bottleneck justifies more aggressive settings.
Metrics
Kommander emits leader-balancer instruments through the Kommander meter:
| Metric | Type | Meaning |
|---|---|---|
raft.balancer.moves_total | Counter | Transfer suggestions by planned, succeeded, or timed_out outcome |
raft.balancer.skipped_passes_total | Counter | Passes skipped because the report view was incomplete |
raft.balancer.count_imbalance | Gauge | Distance between the most-loaded node and the ideal leader count |
raft.balancer.load_imbalance | Gauge | Fractional load skew between nodes |
The two imbalance gauges are meaningful only on the current partition 0 leader. Other nodes report zero, and the node producing meaningful values changes when partition 0 leadership changes.
The metrics exist inside Kommander, but the deployment must export the Kommander meter through its configured telemetry pipeline before an operator can query them.
Reading the Metrics
- Healthy convergence: planned moves are followed by succeeded moves, then both imbalance gauges trend toward zero
- Many timed-out moves: suggestions are not reaching recipients or the suggestion timeout is shorter than report propagation
- Many skipped passes: at least one active node has missing or stale reports
- No moves: the cluster may already be inside the configured count deadband and load threshold
Safety and Failure Behavior
The partition leader, not the planner, has final authority over every transfer. A suggestion is ignored when its term, source leader, destination, or partition state is no longer valid.
Other safety properties include:
- Only the partition
0leader creates plans - An incomplete report view causes the planner to wait
- Recently elected and recently moved partitions are temporarily ineligible
- Per-pass and concurrent-transfer limits bound the amount of movement
- A node failure or election does not get accelerated by the balancer
The expected failure modes are a skipped transfer or an unnecessary transfer. Normal Raft validation prevents the balancer from creating conflicting leaders or losing committed data.