Sharding Design

Sharding lets LoomCache run multiple independent Raft groups in one cluster. It is opt-in; the default deployment keeps the simpler single-group, full-replication model. This page covers routing, rebalance, migration, and cross-group operations.

Production support status: unsupported/fail-closed for general production traffic. Do not enable sharding for production workloads until every group has independent WAL, Raft metadata, state-machine snapshot, install-snapshot, restart recovery evidence, durable migration chunk ACKs, and consensus-backed ownership cutover. Public docs that describe routing, rebalance, SQL scatter, or cross-group transactions are design and validation notes, not a public production support claim.

Components

Routing maps each key hash to a routing partition, and each routing partition to a Raft group.
Group management starts and tracks per-group Raft nodes.
Sharded state-machine storage isolates data by group.
Ownership planning computes changes for the local/in-memory validation path. It is not a production consensus-backed ownership commit.
Migration streaming and ACKs move slot data during validation rebalances. In durable-chunk mode, incoming chunks are committed through the target Raft/WAL path before ACK, and routing cutover remains raft-0 leader gated. General production sharding still requires the explicit sharding release gate and full per-group recovery evidence.
Cross-group execution paths handle multi-group reads and writes.
Cross-group atomic batches use a durable raft-0 two-phase coordinator with configurable timeouts and idempotency retention.

Routing path

The node inspects the operation key or map name.
Routing hashes the key to a routing partition and group.
The message is submitted to the target group’s Raft leader.
That group’s applier mutates only its registry slice.

Rebalance path

The following animation shows a validation rebalance moving partitions between groups:

Partition Migration (Gated Sharding Path)

Illustrative rebalance sequence for sharding validation; production dynamic scaling remains release-gated

GATED PATH IDLE

Group 1

Part 1

Part 2

Group 2

Part 3

Part 4

Group 3

Part 5

Part 6

A group-count configuration change (operator-invoked group scaling) produces a rebalance plan. Node membership changes do not trigger a rebalance — sharded groups are fixed and co-located on every node.
The validation path applies the plan locally; explicitly release-gated production sharded rebalances keep routing cutover behind the raft-0 leader path and the separate production sharding gate.
Source owners stream chunks to targets.
In durable-chunk mode, targets acknowledge chunks only after committing incoming data through the target Raft/WAL path; replay is deduplicated by (sourceNodeId, slotId, migrationEpoch).
Ownership moves after transfer completes: locally in validation runs, and through raft-0 cutover only for an explicitly release-validated sharded production release.

Invariants

A key maps to exactly one owning Raft group at a time.
A group applier writes only to its own registry slice.
Cross-group atomic operations must use 2PC.
Target-side chunk replay must be idempotent.

Failure behavior

In development sharded mode, if a group loses majority, only keys owned by that group stop committing. Cross-group queries can return partial failures when one group is unavailable. Source-crash recovery during migration has validation coverage, but durable per-group restart and consensus ownership cutover remain production gates. Production must therefore fail closed before accepting sharded traffic unless a release explicitly enables that surface.

Verification

Multi-group cluster behavior, partition routing, cross-group SQL, cross-group transactions, migration chunk transfer, and rebalance tests cover this layer as validation evidence. They are not enough for production without fsync-enabled per-group recovery, snapshot, install-snapshot, and full-cluster restart tests. Operators watch per-group leader state, group commit latency, migration progress, chunk ACK failures, and group-specific error rates.

System Architecture & Design — the single-group default model that sharding extends.
Transactions Design — cross-group two-phase commit in detail.

LoomCache is an independent open-source project. It is not affiliated with, endorsed by, or sponsored by Hazelcast, Inc. or by any other company whose products are named in this documentation. “Hazelcast” is a trademark of Hazelcast, Inc.; references to it are nominative and describe only migration and comparison. All other product and company names are trademarks of their respective owners and are used for identification purposes only.

Sharding Design

Components

Routing path

Rebalance path

Partition Migration (Gated Sharding Path)

Invariants

Failure behavior

Verification

Related