Skip to content

Sharding Design

Sharding lets LoomCache run multiple independent Raft groups in one cluster. It is opt-in; the default deployment keeps the simpler single-group, full-replication model.

Production support status: unsupported/fail-closed. Do not enable sharding for production workloads until every group has independent WAL, Raft metadata, state-machine snapshot, install-snapshot, and restart recovery evidence. Public docs that describe routing, rebalance, SQL scatter, or cross-group transactions are design and certification notes, not a bank-production support claim.

  • PartitionRouter maps key hash to routing partition to Raft group.
  • RaftGroupManager starts and tracks per-group Raft nodes.
  • ShardedDataStructureRegistry isolates state-machine data by group.
  • PartitionRebalancer, PartitionAssignment, PartitionMove, and RebalancePlan compute ownership changes for the local/in-memory certification path. They are not a production consensus-backed ownership commit.
  • PartitionMigrationPipeline, TcpMigrationDataSender, and migration ACKs stream slot data during certification rebalances. The ACK confirms receipt/apply in the target path, not target Raft/WAL durability.
  • CrossGroupQueryExecutor and CrossGroupTransactionExecutor handle multi-group reads and writes.
  1. CacheNode inspects the operation key or map name.
  2. PartitionRouter hashes the key to a routing partition and group.
  3. The message is submitted to the target group’s Raft leader.
  4. That group’s applier mutates only its registry slice.
  1. Membership or configuration changes produce a rebalance plan.
  2. The current certification implementation applies the plan locally; the production path must remain fail-closed until the plan and ownership cutover are committed through consensus.
  3. Source owners stream chunks to targets.
  4. Targets acknowledge chunks and deduplicate replay by (sourceNodeId, slotId), but that ACK is not a durable target Raft/WAL commit proof.
  5. Ownership moves after transfer completes only in the certification path.
  • A key maps to exactly one owning Raft group at a time.
  • A group applier writes only to its own registry slice.
  • Cross-group atomic operations must use 2PC.
  • Target-side chunk replay must be idempotent.

In development sharded mode, if a group loses majority, only keys owned by that group stop committing. Cross-group queries can return partial failures when one group is unavailable. Source-crash recovery during migration and durable per-group restart remain implementation gaps, so production must fail closed before accepting sharded traffic or migration ACKs.

Multi-group cluster behavior, partition routing, cross-group SQL, cross-group transactions, migration chunk transfer, and rebalance tests cover this layer as certification evidence. They are not enough for production without fsync-enabled per-group recovery, snapshot, install-snapshot, and full-cluster restart tests. Operators watch per-group leader state, group commit latency, migration progress, chunk ACK failures, and group-specific error rates.