Metrics And Diagnostics

Kommander exposes runtime diagnostics through two main channels:

metrics from the Kommander .NET Meter
structured logs from the Raft runtime.

Together, they help answer practical questions such as:

why a proposal is slow
whether a partition is overloaded
whether scheduler fairness is holding under load
whether leadership churn is causing stale completions or elections.

Meter Name

Kommander publishes metrics through the .NET Meter named:

Kommander

Any OpenTelemetry or MeterListener consumer should subscribe to that meter name.

What Is Tracked

The current code exports these core signals.

Partition Queue Depth

Observable gauge:

raft.executor.client_queue_depth

This reports the current client proposal queue depth per partition executor, tagged by:

partition_id

This is the main queue-depth metric currently exported directly by KommanderMetrics.

Operation Throughput And Latency

Counters:

raft.executor.operations_total
raft.executor.rejections_total

Histogram:

raft.executor.operation_duration_ms

These are tagged by:

partition_id
operation_class

operation_class corresponds to the executor work classes:

Control
Replication
Client
Maintenance

This is the most useful place to look when an operation feels slow. If Client latency climbs while Control and Replication stay healthy, the system is preserving Raft priority correctly. If control-plane latency also climbs, you likely have deeper scheduler or storage pressure.

WAL Batching And Throughput

Counters:

raft.wal.batches_total
raft.wal.operations_total

Histogram:

raft.wal.batch_size

raft.wal.batches_total increments once per scheduler group write. A group write may span more than one partition.

raft.wal.batch_size records the number of WAL write operations drained for each partition inside that group write. It is a per-partition batch-size distribution, not the number of partitions included in the group.

This helps validate scheduler fairness and batching efficiency under load:

very small batches can mean poor batching opportunities or low traffic
consistently large batches can mean good amortization
large batches paired with rising latency can mean the system is absorbing bursts but paying for them in per-flush delay.
increasing raft.wal.operations_total faster than raft.wal.batches_total usually means batching is reducing storage calls.

Stale Completion Drops

Counter:

raft.stale_completions_total

This counts WAL completions that were discarded because they were stale, such as:

wrong partition
wrong term
mismatched operation id.

A sustained rise here usually points to leadership churn, delayed completions, or retries arriving after the partition has already moved on.

Elections And Heartbeats

Counters:

raft.elections_started_total
raft.heartbeats_sent_total

Histogram:

raft.election_delay_ms

raft.election_delay_ms records how long it had been since the last received heartbeat when an election started.

This is the direct delay metric currently exported in the codebase. The current KommanderMetrics implementation does not expose a separate heartbeat delay histogram by name. Heartbeat behavior is instead observed through:

raft.heartbeats_sent_total
raft.election_delay_ms
slow-dispatch logs
proposal and append latency patterns.

What The Logs Add

Metrics tell you that something is slow. Logs help explain which request or partition was slow.

Two existing configuration thresholds remain especially useful:

SlowRaftStateMachineLog
SlowRaftWALMachineLog

After the actor runtime removal, the state machine still runs behind the serial partition executor. Slow dispatch logs are emitted from RaftPartitionExecutor, so the old “slow Raft state machine” idea is still useful even though the implementation is no longer actor-based.

Typical useful log patterns include:

slow dispatch warnings from the partition executor
election start warnings that include time since last heartbeat
stale WAL completion warnings
WAL restore and proposal completion timing
WAL write timing logs in the storage path.

How To Reason About Slow Operations

When an operation is slow, check the signals in this order:

raft.executor.client_queue_depth
raft.executor.rejections_total
raft.executor.operation_duration_ms
raft.wal.batch_size
raft.stale_completions_total
raft.election_delay_ms

That usually lets you distinguish between:

client admission pressure
WAL batching or storage pressure
election churn
stale-completion cleanup after leadership changes.

Validating Scheduler Fairness

Scheduler fairness is not one single metric. You validate it by looking at the shape of several signals together:

Control and Replication latency should remain bounded even when Client traffic spikes.
client queue depth may grow, but heartbeats and elections should still progress.
WAL batch sizes should increase under load without one hot partition causing total starvation elsewhere.
ProposalQueueFull rejections are preferable to runaway memory growth or broken Raft responsiveness.

In load tests, the important question is not “did the queue grow?” but “did control-plane work stay healthy while load increased?”

Important Limits

Two distinctions matter when reading the current diagnostics surface:

Kommander currently exports direct metrics for partition executor queue depth, not a full family of per-scheduler queue-depth gauges.
Kommander currently exports direct election-delay telemetry, but not a separate heartbeat-delay histogram.

That still gives enough visibility to diagnose most overload and fairness problems when you combine the metrics with slow-dispatch and election logs.

Meter Name​

What Is Tracked​

Partition Queue Depth​

Operation Throughput And Latency​

WAL Batching And Throughput​

Stale Completion Drops​

Elections And Heartbeats​

What The Logs Add​

How To Reason About Slow Operations​

Validating Scheduler Fairness​

Important Limits​

Meter Name

What Is Tracked

Partition Queue Depth

Operation Throughput And Latency

WAL Batching And Throughput

Stale Completion Drops

Elections And Heartbeats

What The Logs Add

How To Reason About Slow Operations

Validating Scheduler Fairness

Important Limits