Metrics And Diagnostics
Kommander exposes runtime diagnostics through two main channels:
- metrics from the
Kommander.NETMeter - structured logs from the Raft runtime.
Together, they help answer practical questions such as:
- why a proposal is slow
- whether a partition is overloaded
- whether scheduler fairness is holding under load
- whether leadership churn is causing stale completions or elections.
Meter Name
Kommander publishes metrics through the .NET Meter named:
Kommander
Any OpenTelemetry or MeterListener consumer should subscribe to that meter name.
What Is Tracked
The current code exports these core signals.
Partition Queue Depth
Observable gauge:
raft.executor.client_queue_depth
This reports the current client proposal queue depth per partition executor, tagged by:
partition_id
This is the main queue-depth metric currently exported directly by KommanderMetrics.
Operation Throughput And Latency
Counters:
raft.executor.operations_totalraft.executor.rejections_total
Histogram:
raft.executor.operation_duration_ms
These are tagged by:
partition_idoperation_class
operation_class corresponds to the executor work classes:
ControlReplicationClientMaintenance
This is the most useful place to look when an operation feels slow. If Client latency climbs while Control and Replication stay healthy, the system is preserving Raft priority correctly. If control-plane latency also climbs, you likely have deeper scheduler or storage pressure.
WAL Batching And Throughput
Counters:
raft.wal.batches_totalraft.wal.operations_total
Histogram:
raft.wal.batch_size
raft.wal.batches_total increments once per scheduler group write. A group write may span more than one partition.
raft.wal.batch_size records the number of WAL write operations drained for each partition inside that group write. It is a per-partition batch-size distribution, not the number of partitions included in the group.
This helps validate scheduler fairness and batching efficiency under load:
- very small batches can mean poor batching opportunities or low traffic
- consistently large batches can mean good amortization
- large batches paired with rising latency can mean the system is absorbing bursts but paying for them in per-flush delay.
- increasing
raft.wal.operations_totalfaster thanraft.wal.batches_totalusually means batching is reducing storage calls.
Stale Completion Drops
Counter:
raft.stale_completions_total
This counts WAL completions that were discarded because they were stale, such as:
- wrong partition
- wrong term
- mismatched operation id.
A sustained rise here usually points to leadership churn, delayed completions, or retries arriving after the partition has already moved on.
Elections And Heartbeats
Counters:
raft.elections_started_totalraft.heartbeats_sent_total
Histogram:
raft.election_delay_ms
raft.election_delay_ms records how long it had been since the last received heartbeat when an election started.
This is the direct delay metric currently exported in the codebase. The current KommanderMetrics implementation does not expose a separate heartbeat delay histogram by name. Heartbeat behavior is instead observed through:
raft.heartbeats_sent_totalraft.election_delay_ms- slow-dispatch logs
- proposal and append latency patterns.
What The Logs Add
Metrics tell you that something is slow. Logs help explain which request or partition was slow.
Two existing configuration thresholds remain especially useful:
SlowRaftStateMachineLogSlowRaftWALMachineLog
After the actor runtime removal, the state machine still runs behind the serial partition executor. Slow dispatch logs are emitted from RaftPartitionExecutor, so the old “slow Raft state machine” idea is still useful even though the implementation is no longer actor-based.
Typical useful log patterns include:
- slow dispatch warnings from the partition executor
- election start warnings that include time since last heartbeat
- stale WAL completion warnings
- WAL restore and proposal completion timing
- WAL write timing logs in the storage path.
How To Reason About Slow Operations
When an operation is slow, check the signals in this order:
raft.executor.client_queue_depthraft.executor.rejections_totalraft.executor.operation_duration_msraft.wal.batch_sizeraft.stale_completions_totalraft.election_delay_ms
That usually lets you distinguish between:
- client admission pressure
- WAL batching or storage pressure
- election churn
- stale-completion cleanup after leadership changes.
Validating Scheduler Fairness
Scheduler fairness is not one single metric. You validate it by looking at the shape of several signals together:
ControlandReplicationlatency should remain bounded even whenClienttraffic spikes.- client queue depth may grow, but heartbeats and elections should still progress.
- WAL batch sizes should increase under load without one hot partition causing total starvation elsewhere.
ProposalQueueFullrejections are preferable to runaway memory growth or broken Raft responsiveness.
In load tests, the important question is not “did the queue grow?” but “did control-plane work stay healthy while load increased?”
Important Limits
Two distinctions matter when reading the current diagnostics surface:
- Kommander currently exports direct metrics for partition executor queue depth, not a full family of per-scheduler queue-depth gauges.
- Kommander currently exports direct election-delay telemetry, but not a separate heartbeat-delay histogram.
That still gives enough visibility to diagnose most overload and fairness problems when you combine the metrics with slow-dispatch and election logs.