Skip to main content

Configuration

RaftConfiguration controls node identity, network behavior, election timing, and fair WAL scheduler workers.

PropertyDefaultDescription
NodeNamemachine nameStable node name used when deriving a node id.
NodeId0Integer node id. 0 means derive from NodeName.
HostnullHost advertised as part of the node endpoint.
Port0Port advertised as part of the node endpoint.
InitialPartitions1Number of initial user partitions. Partition 0 is reserved; application partitions start at 1.
HttpSchemehttps://Scheme used by RestCommunication.
TransportSecuritynew options objectTransport security and node authentication settings for network transports.
HttpAuthBearerTokenemptyLegacy bearer token for REST requests. Prefer TransportSecurity.SharedSecret or other TransportSecurity settings instead.
HttpTimeout5REST request timeout in seconds.
HttpVersion2.0REST HTTP version.
HeartbeatInterval500 msLeader heartbeat interval.
RecentHeartbeat100 msPer-partition heartbeat throttle window for leader heartbeats sent to a follower.
VotingTimeout1500 msCandidate vote wait timeout.
CheckLeaderInterval250 msLeader election supervision interval.
TimerInitialDelay2500 msInitial delay before periodic Raft timers start firing.
UpdateNodesInterval5000 msDiscovery refresh interval.
StartElectionTimeout2000 msLower election timeout bound.
EndElectionTimeout4000 msUpper election timeout bound.
StartElectionTimeoutIncrement100 msLower timeout backoff increment.
EndElectionTimeoutIncrement200 msUpper timeout backoff increment.
ElectionTimeoutSeednullOptional deterministic seed for partition election timeout randomization. Use in tests and simulations when you need reproducible leader-election timing.
SlowRaftStateMachineLog50 msSlow partition state-machine operation warning threshold.
SlowRaftWALMachineLog25 msSlow WAL warning threshold.
ReadIOThreads8Fair scheduler workers for synchronous WAL reads.
WriteIOThreads4Fair scheduler workers for synchronous WAL writes.
MaxQueuedClientProposalsPerPartition2048Per-partition client proposal queue limit. When full, new proposals are rejected with ProposalQueueFull. Set to 0 or lower to disable the limit.
MaxWalQueueDepthPerPartition4096Per-partition WAL scheduler pending-write depth limit. When exceeded, WAL backpressure is propagated instead of allowing unbounded growth.
MaxGlobalWalQueueDepth0Global WAL scheduler pending-write depth limit across all partitions. 0 disables the global cap and keeps only per-partition limits.
MaxWalBatchSize256Maximum WAL write operations grouped into one storage flush. Larger batches reduce call overhead but can increase individual write latency.
MaxWalGroupBatchPartitions64Maximum number of ready partitions coalesced into one cross-partition WAL write call. For RocksDB this can reduce many partition writes to one db.Write / fsync. For SQLite this allows the adapter to group writes by shard.
MaxDrainQuantumControl8Maximum control-plane operations drained per partition-executor wake cycle.
MaxDrainQuantumReplication4Maximum replication operations drained per partition-executor wake cycle.
MaxDrainQuantumClient2Maximum client operations drained per partition-executor wake cycle.
MaxDrainQuantumMaintenance1Maximum maintenance operations drained per partition-executor wake cycle.
EnableQuiescencetrueAllows idle partitions to suppress per-partition heartbeats and rely on SWIM node liveness until new work arrives.
QuiesceAfter1500 msIdle duration before a leader quiesces a partition. Requires no active proposals.
BackfillThreshold10Follower lag threshold that switches the leader from empty heartbeats to active committed-log backfill.
MaxBackfillEntriesPerRound128Maximum committed log entries shipped to one stale follower per backfill round.
LearnerPromotionLag10Maximum lag a learner may have on any partition and still be considered caught up enough for promotion.
LearnerPromotionStableWindow3 sHow long a learner must remain within LearnerPromotionLag before promotion to voter.
GossipInterval5 sInterval between membership gossip rounds.
GossipFanout2Random peers contacted per gossip round. 0 disables gossip.
PingTimeout500 msSWIM direct/indirect probe timeout.
IndirectPingFanout2Number of relay peers used for indirect SWIM probes.
SuspicionTimeout5 sHow long a node stays Suspect before becoming Dead.
DeadMemberEvictionGrace30 sHow long a node remains Dead before the system-partition leader evicts it.
PingInterval1 sSWIM ping round interval. Set to 0 or lower to disable the detector. Must be greater than 0 and lower than StartElectionTimeout when EnableQuiescence is true.
CompactEveryOperations10000Committed operations between automatic WAL compaction triggers per partition. Set to 0 or lower to disable automatic compaction.
CompactNumberEntries100Max entries the WAL adapter is asked to remove per CompactLogsOlderThan call. Values below 1 are treated as 1.
MaxEntriesPerCompaction5000Upper bound on entries removed during one triggered compaction pass before yielding. Values below CompactNumberEntries are treated as CompactNumberEntries.

Transport Security

TransportSecurity is a nested RaftTransportSecurityOptions object used by network transports such as REST and gRPC.

PropertyDefaultDescription
NodeAuthenticationModeDisabledNode-to-node authentication mode. Supported values are Disabled, SharedSecret, and MutualTls.
SharedSecretnullShared cluster secret used for signed node-to-node requests when NodeAuthenticationMode is SharedSecret.
HeaderNameX-Kommander-Cluster-AuthHTTP header or transport metadata name that carries the request signature.
RequireTlstrueReject non-TLS network transport requests when authentication requires secure transport.
AllowInsecureCertificateValidationfalseDevelopment-only certificate validation bypass for client transports. Do not enable in production.
AllowedClockSkew60 sMaximum clock skew allowed when validating signed requests.
TrustedServerCertificateThumbprintsemptyOptional allow-list of trusted server certificate thumbprints.
TrustedClientCertificateThumbprintsemptyOptional allow-list of trusted client certificate thumbprints.

The configuration still supports HttpAuthBearerToken for legacy compatibility. Internally, GetEffectiveTransportSecurity() falls back to that bearer token when TransportSecurity.SharedSecret is not set.

Queueing And Backpressure

Kommander uses explicit admission control so client traffic and WAL pressure cannot grow without bound.

  • MaxQueuedClientProposalsPerPartition limits pending client proposals inside a partition executor.
  • MaxWalQueueDepthPerPartition and MaxGlobalWalQueueDepth limit queued WAL writes before scheduler backpressure is raised.
  • MaxWalBatchSize controls how many WAL write operations may be combined into one flush.
  • MaxWalGroupBatchPartitions controls how many ready partitions may share one cross-partition WAL write call.

If a client proposal limit is hit, the runtime can reject new work with RaftOperationStatus.ProposalQueueFull instead of letting memory usage grow indefinitely.

WAL Write Batching

FairWalScheduler can batch writes in two dimensions:

PropertyDefaultDescription
MaxWalBatchSize256Maximum operations drained from one partition into a single WAL batch.
MaxWalGroupBatchPartitions64Maximum ready partitions coalesced into one IWAL.Write call.
WriteIOThreads4Number of scheduler workers. Each worker can process one cross-partition group batch at a time.

For RocksDB, a group batch spanning many partitions is written through one WriteBatch, which can reduce fsync pressure significantly in many-partition deployments.

For SQLite, partitions are distributed across a fixed shard pool. The scheduler still submits one cross-partition IWAL.Write call, and SqliteWAL groups that call by shard before writing. A batch with P partitions across S SQLite shards costs S transactions and fsyncs, not P. When shardCount is 1, every partition shares one shard and the whole scheduler group can commit in one SQLite transaction.

That creates a practical tuning tradeoff:

  • fewer SQLite shards improve batching and reduce fsync pressure
  • more SQLite shards allow more independent read/write concurrency
  • the shard count is fixed for a WAL data directory after initialization because changing it would remap partitions to different database files.

Dynamic Membership

Kommander supports runtime cluster membership management with learners, promotion, gossip dissemination, and SWIM-based failure detection.

PropertyDefaultDescription
BackfillThreshold10Follower lag threshold that switches the leader from empty heartbeats to active committed-log backfill.
MaxBackfillEntriesPerRound128Maximum committed log entries shipped to one stale follower per backfill round.
LearnerPromotionLag10Maximum lag a learner may have on any partition and still be considered caught up enough for promotion.
LearnerPromotionStableWindow3 sHow long a learner must remain within LearnerPromotionLag before promotion to voter.
GossipInterval5 sInterval between membership gossip rounds.
GossipFanout2Random peers contacted per gossip round. 0 disables gossip.
PingInterval1 sSWIM ping round interval. Set to 0 or lower to disable the detector. Must stay below StartElectionTimeout when quiescence is enabled.
PingTimeout500 msSWIM direct/indirect probe timeout.
IndirectPingFanout2Number of relay peers used for indirect SWIM probes.
SuspicionTimeout5 sHow long a node stays Suspect before becoming Dead.
DeadMemberEvictionGrace30 sHow long a node remains Dead before the system-partition leader evicts it.

The built-in in-memory, gRPC, and REST transports all implement direct and indirect SWIM pings. If you disable SWIM by setting PingInterval to 0, also set EnableQuiescence = false.

Partition Quiescence

Quiescence suppresses per-partition heartbeat traffic for idle partitions. A leader sends a final quiesce marker, then followers rely on SWIM node liveness until the partition wakes up again.

PropertyDefaultDescription
EnableQuiescencetrueEnables quiescence for idle partitions. Set to false to keep sending per-partition heartbeats on every heartbeat interval.
QuiesceAfter1500 msHow long a partition must be idle, with no active proposals, before it quiesces.
PingInterval1 sSWIM probe cadence used by quiesced followers to detect leader-node failure. Must be greater than 0 and lower than StartElectionTimeout when quiescence is enabled.
SuspicionTimeout5 sTime from Suspect to Dead. Quiesced failover starts on Suspect, not Dead.
StartElectionTimeout2000 msLower election timeout bound. PingInterval must be below this while quiescence is enabled.

Executor Drain Quanta

The MaxDrainQuantum* settings tune how many operations each partition executor drains per wake cycle for each work class:

  • control
  • replication
  • client
  • maintenance.

Higher control and replication quanta help Raft protocol traffic stay ahead of client floods. In most deployments, the defaults are the right starting point.

Timing Notes

Two timing behaviors matter for operators and test authors:

  • ElectionTimeoutSeed lets each partition derive its election timeout randomness from a deterministic seed combined with the partition id. That makes election behavior reproducible in tests without making every partition use the exact same sequence.
  • RecentHeartbeat throttles heartbeats per (node, partition) pair. That avoids one busy partition suppressing heartbeats for every other partition on the same follower.