Persistence & Durability
LoomCache combines a write-ahead log with periodic state-machine snapshots for durability. Implementation
lives in loom-server/src/main/java/com/loomcache/server/persistence.
Persistence is enabled by default (dataDir = ./data/wal). Set dataDir = null on ClusterConfig (or disable
loomcache.server.persistence.enabled in Spring Boot) to run in-memory — only useful for tests.
Commit sequence
Section titled “Commit sequence”- Client sends a mutating command (e.g.
MAP_PUT). - Leader appends the message as a Raft
LogEntry, sendsAppendEntriesto followers. WalWriterfsyncs the entry with CRC32 and updates the sidecar.checksumfile.- Once a majority of followers acknowledges, the leader advances
commitIndex. - The state-machine applier decodes the command and applies it via
DataOperationHandler. - Response is returned to the client.
The WAL fsync order ensures a crash after step 2 but before step 3 is indistinguishable from the command never arriving.
WAL files
Section titled “WAL files”- Per-node directory at
dataDir. - Rotating append-only segments (
WalWriter) with a trailing CRC32. - Each segment has a sidecar
.checksumfile validated on boot; corruption is surfaced as anIllegalStateExceptionduring node startup. WalReaderreplays segments in order.WalCompactortruncates segments once the last snapshot index has advanced.
Snapshots
Section titled “Snapshots”SnapshotSchedulertriggers snapshots based onloomcache.server.persistence.snapshot-threshold(default 10 000 committed entries).StateMachineSnapshotManagerwalks theDataStructureRegistryand streams a serialized state toSnapshotStore.DeltaSnapshot+SnapshotChainsupport incremental snapshots across time.- Every snapshot carries a CRC32 checksum; corrupt snapshots are rejected on load.
Recovery
Section titled “Recovery”On CacheNode.start:
- Validate the configured WAL directory is writable.
- Load the most recent valid snapshot (
SnapshotStore). - Replay the WAL from
snapshotIndex + 1. - Reconstruct
RaftMetadataStoreterm / vote / commit index — reject mismatches. - Only after the above succeeds does the node begin to accept traffic.
Failure to replay — WAL corruption, snapshot mismatch, metadata inconsistency — aborts startup with a diagnostic.
Install snapshot
Section titled “Install snapshot”Followers that fall too far behind receive RAFT_INSTALL_SNAPSHOT from the leader. RaftNode applies the snapshot,
then resumes normal replication. The install path is itself Jepsen-tested via CrashDuringCompactionTest and
FullDiskTest.
Tuning
Section titled “Tuning”| Knob | Default | Notes |
|---|---|---|
dataDir (ClusterConfig) | ./data/wal | Put on SSD/NVMe. |
loomcache.server.persistence.snapshot-threshold | 10_000 | Lower = more frequent snapshots, more IO. |
loomcache.server.persistence.wal-directory | ./data/wal | Distinct per node. |
loomcache.persistence.backup-dir | unset | Separate filesystem/object-store mount for Hot Backup. |
loomcache.persistence.hot-backup-interval-seconds | 0 | 0 disables periodic Hot Backup. |
idempotencyTtlMs | 60_000 | Raise for long-lived clients that retry older requests. |
Optional JPA write-through
Section titled “Optional JPA write-through”loom-spring-boot can back DistributedMap with a JPA repository (see CacheEntry,
CacheEntryRepository, CacheNodeConfig). Useful when you need the cache to survive a full cluster wipe backed by
an external relational store. This is independent of the Raft WAL and runs in the Spring module only.
Backup and restore runbook
Section titled “Backup and restore runbook”Use this procedure when a machine dies, a disk is being replaced, or an operator needs to rehearse disaster
recovery. It assumes persistence is enabled and every member has a stable nodeId.
Backup types
Section titled “Backup types”| Backup type | What it captures | Restore use |
|---|---|---|
Filesystem copy of dataDir | WAL segments, sidecar checksum files, Raft metadata, and local snapshots. | Preferred for replacing a single failed machine with the same nodeId. |
Hot Backup (persistence.backup-dir) | Operator-triggered group snapshots plus a manifest under a separate backup directory. | Point-in-time snapshot retention and off-host archival. Keep alongside full dataDir backups until a dedicated restore command is introduced. |
Store backups outside the live disk. A local copy on the same volume does not protect against a machine or volume loss.
Routine backup
Section titled “Routine backup”- Configure each member with a persistent
dataDir. - Configure
loomcache.persistence.backup-diron a different disk or mounted object-store path. - Set
loomcache.persistence.hot-backup-interval-secondsfor periodic Hot Backup, or triggerCacheNode.triggerHotBackup()from an operator hook. - Keep regular filesystem backups of the full
dataDirtree, including WAL segment sidecar files and snapshot files. - After every backup job, verify that the copied tree contains the node’s WAL directory,
snapshots/, and checksum sidecars. For Hot Backup, verify the newesthot-backup-<nodeId>-<timestamp>.propertiesmanifest and every referenced group snapshot file.
Replace one dead machine when quorum survived
Section titled “Replace one dead machine when quorum survived”Use this path when the cluster still has a majority and is serving traffic.
- Confirm the failed member is absent from
/api/cluster/statusor the management dashboard. - Provision a replacement host with the same LoomCache version, port configuration, TLS/security material, and
nodeIdas the failed member. - Restore the failed member’s most recent
dataDirbackup to the same configured path on the replacement host. - Preserve ownership and permissions for the process user.
- Start the replacement member.
- Watch startup logs for snapshot load, WAL replay, and Raft metadata validation. Startup must fail closed on corrupt WAL/snapshot data; do not delete files to make the node boot unless you intentionally choose partial recovery.
- Wait for the member to appear in cluster status and for partition migration/backup-promotion metrics to settle.
- Run read checks for critical maps and a low-risk write/read smoke test.
If no dataDir backup exists but the cluster still has quorum, start a clean replacement with the intended nodeId
only after confirming the remaining members own the latest committed data. The cluster will rebuild ownership through
normal membership and migration flows; the lost node-local WAL cannot be reconstructed on disk.
Restore after quorum loss
Section titled “Restore after quorum loss”Use this path when too many machines died for the remaining cluster to make progress.
- Stop every surviving member to prevent mixed old/new recovery attempts.
- Pick a consistent backup point across enough members to form a majority. Prefer backups taken from the same maintenance window.
- Restore each selected member’s full
dataDirbackup to a replacement host with the samenodeId. - Start only the restored majority first.
- Verify Raft startup, cluster state, and data smoke tests before admitting extra clean members.
- After the restored majority is healthy, add remaining members one at a time and let migration finish between adds.
Use cluster-data-recovery-policy=FULL when you require every persisted group to validate before startup. Use
PARTIAL_MOST_RECENT or PARTIAL_MOST_COMPLETE only during an explicit incident response where availability is more
important than rejecting an incomplete local recovery.
Validation checklist
Section titled “Validation checklist”- Latest backup timestamp is within the recovery point objective.
nodeIdin configuration matches the restored data directory.- WAL
.checksumsidecars and snapshot files are present. - Startup logs show successful snapshot load and WAL replay, or a deliberate partial-recovery decision.
- Cluster state is
ACTIVE. - Partition migration has no active failures.
- Critical maps/lists/queues pass read checks.
- A low-risk write/read smoke test succeeds after the member rejoins.
Rollback
Section titled “Rollback”If validation fails, stop the replacement member, move the attempted dataDir aside, restore the previous known-good
copy, and start again. Keep the failed attempt for forensic analysis; it contains the WAL/snapshot evidence needed to
debug corruption or version-mismatch failures.