Sharding Design
Sharding lets LoomCache run multiple independent Raft groups in one cluster. It is opt-in; the default deployment keeps the simpler single-group, full-replication model.
Production support status: unsupported/fail-closed. Do not enable sharding for production workloads until every group has independent WAL, Raft metadata, state-machine snapshot, install-snapshot, and restart recovery evidence. Public docs that describe routing, rebalance, SQL scatter, or cross-group transactions are design and certification notes, not a bank-production support claim.
Components
Section titled “Components”PartitionRoutermaps key hash to routing partition to Raft group.RaftGroupManagerstarts and tracks per-group Raft nodes.ShardedDataStructureRegistryisolates state-machine data by group.PartitionRebalancer,PartitionAssignment,PartitionMove, andRebalancePlancompute ownership changes for the local/in-memory certification path. They are not a production consensus-backed ownership commit.PartitionMigrationPipeline,TcpMigrationDataSender, and migration ACKs stream slot data during certification rebalances. The ACK confirms receipt/apply in the target path, not target Raft/WAL durability.CrossGroupQueryExecutorandCrossGroupTransactionExecutorhandle multi-group reads and writes.
Routing path
Section titled “Routing path”CacheNodeinspects the operation key or map name.PartitionRouterhashes the key to a routing partition and group.- The message is submitted to the target group’s Raft leader.
- That group’s applier mutates only its registry slice.
Rebalance path
Section titled “Rebalance path”- Membership or configuration changes produce a rebalance plan.
- The current certification implementation applies the plan locally; the production path must remain fail-closed until the plan and ownership cutover are committed through consensus.
- Source owners stream chunks to targets.
- Targets acknowledge chunks and deduplicate replay by
(sourceNodeId, slotId), but that ACK is not a durable target Raft/WAL commit proof. - Ownership moves after transfer completes only in the certification path.
Invariants
Section titled “Invariants”- A key maps to exactly one owning Raft group at a time.
- A group applier writes only to its own registry slice.
- Cross-group atomic operations must use 2PC.
- Target-side chunk replay must be idempotent.
Failure behavior
Section titled “Failure behavior”In development sharded mode, if a group loses majority, only keys owned by that group stop committing. Cross-group queries can return partial failures when one group is unavailable. Source-crash recovery during migration and durable per-group restart remain implementation gaps, so production must fail closed before accepting sharded traffic or migration ACKs.
Verification
Section titled “Verification”Multi-group cluster behavior, partition routing, cross-group SQL, cross-group transactions, migration chunk transfer, and rebalance tests cover this layer as certification evidence. They are not enough for production without fsync-enabled per-group recovery, snapshot, install-snapshot, and full-cluster restart tests. Operators watch per-group leader state, group commit latency, migration progress, chunk ACK failures, and group-specific error rates.