WAL Internals
The write-ahead log, or WAL, is how Kommander remembers Raft log entries across restarts.
RaftWriteAhead is the internal facade between the partition state machine and the configured IWAL adapter.
Log Types
Kommander stores Raft entries with explicit lifecycle types:
| Type | Meaning |
|---|---|
Proposed | The leader has proposed an application entry. |
Committed | The proposed entry is committed and can be applied. |
RolledBack | The proposed entry was explicitly rolled back. |
ProposedCheckpoint | A proposed checkpoint marker. |
CommittedCheckpoint | A committed checkpoint marker. |
RolledBackCheckpoint | A checkpoint proposal that was rolled back. |
The application normally sees committed entries through OnReplicationReceived and restored committed entries through OnLogRestored.
Recovery
When a partition executor starts, it calls WAL recovery before normal operations are accepted.
Recovery:
- reads logs for the partition through
ReadScheduler - advances local propose and commit indexes
- ignores proposed and rolled-back entries for application restore
- invokes
OnLogRestoredfor committed application logs - invokes system restore callbacks for committed system logs
- marks restore complete for the partition.
If there are no logs, the commit index starts after the adapter's current max log id.
Leader Write Path
For a leader proposal:
- the state machine assigns log ids and the current term
RaftWriteAheadenqueues a leader propose writeFairWalSchedulerwrites the proposed entries- completion returns to the partition executor
- the state machine creates a proposal quorum tracker
- append-log messages are sent to followers.
For auto-commit proposals, the leader commits after quorum completion. For manual proposals, the caller uses the proposal ticket with CommitLogs or RollbackLogs.
Cross-Partition Group Commit
FairWalScheduler can batch writes from multiple ready partitions into one IWAL.Write call.
The scheduler batches in two layers:
- up to
MaxWalBatchSizeoperations from one partition - up to
MaxWalGroupBatchPartitionspartitions in one group write.
The important guarantee is unchanged: writes remain FIFO within each partition. A partition can appear in only one in-flight group batch at a time.
For RocksDB, the grouped call is written as one WriteBatch, so many partition writes can share one db.Write / fsync. This is a major performance win for clusters with many active partitions.
For SQLite, partitions are mapped into a fixed set of shard databases. The adapter groups the scheduler's cross-partition batch by shard, merges same-partition entries inside each shard, and commits one SQLite transaction per shard. A batch of P partitions spanning S shards therefore costs S SQLite transactions and fsyncs. This keeps the scheduler-level fairness model while improving SQLite write amortization.
If one shard write fails, the scheduler reports the group status as errored to all operations in that group. Retries are safe because WAL writes are idempotent.
SQLite Shard Batching
SqliteWAL stores logs in files named like raft_shard{shardId}_{revision}.db. A partition maps to a shard with partitionId mod shardCount.
The shard count controls the main SQLite batching tradeoff:
| Choice | Effect |
|---|---|
Lower shardCount | More partitions share a shard, so cross-partition scheduler batches collapse into fewer SQLite transactions and fsyncs. |
Higher shardCount | More shard files can operate independently, which can improve concurrency when many partitions are active at the same time. |
For a new data directory, new SqliteWAL(path, revision, logger, syncWrites, shardCount) seeds the shard count. A shardCount of 0 uses Environment.ProcessorCount. After the directory is initialized, the resolved shard count is persisted in metadata and reused on later opens. Passing a different non-zero value for an existing directory fails fast because it would route existing partitions to different shard files.
Follower Append Path
Followers receive append-log messages from the leader. The state machine validates leadership and term expectations, then asks RaftWriteAhead to propose, commit, or roll back entries as needed.
Committed follower entries are applied to the application callback after the WAL write succeeds.
WAL Completion Fencing
WAL completions are not trusted blindly. Completion messages are checked against pending operations and log ranges. Unknown, stale, superseded, or malformed completions are discarded.
This protects the partition from acting on a storage completion that belongs to an older term, an already-processed operation, or an invalid log range.