Skip to content

Persistence & Durability

LoomCache combines a write-ahead log with periodic state-machine snapshots for durability. Implementation lives in loom-server/src/main/java/com/loomcache/server/persistence.

Persistence is enabled by default (dataDir = ./data/wal). Set dataDir = null on ClusterConfig (or disable loomcache.server.persistence.enabled in Spring Boot) to run in-memory — only useful for tests.

  1. Client sends a mutating command (e.g. MAP_PUT).
  2. Leader appends the message as a Raft LogEntry, sends AppendEntries to followers.
  3. WalWriter fsyncs the entry with CRC32 and updates the sidecar .checksum file.
  4. Once a majority of followers acknowledges, the leader advances commitIndex.
  5. The state-machine applier decodes the command and applies it via DataOperationHandler.
  6. Response is returned to the client.

The WAL fsync order ensures a crash after step 2 but before step 3 is indistinguishable from the command never arriving.

  • Per-node directory at dataDir.
  • Rotating append-only segments (WalWriter) with a trailing CRC32.
  • Each segment has a sidecar .checksum file validated on boot; corruption is surfaced as an IllegalStateException during node startup.
  • WalReader replays segments in order.
  • WalCompactor truncates segments once the last snapshot index has advanced.
  • SnapshotScheduler triggers snapshots based on loomcache.server.persistence.snapshot-threshold (default 10 000 committed entries).
  • StateMachineSnapshotManager walks the DataStructureRegistry and streams a serialized state to SnapshotStore.
  • DeltaSnapshot + SnapshotChain support incremental snapshots across time.
  • Every snapshot carries a CRC32 checksum; corrupt snapshots are rejected on load.

On CacheNode.start:

  1. Validate the configured WAL directory is writable.
  2. Load the most recent valid snapshot (SnapshotStore).
  3. Replay the WAL from snapshotIndex + 1.
  4. Reconstruct RaftMetadataStore term / vote / commit index — reject mismatches.
  5. Only after the above succeeds does the node begin to accept traffic.

Failure to replay — WAL corruption, snapshot mismatch, metadata inconsistency — aborts startup with a diagnostic.

Followers that fall too far behind receive RAFT_INSTALL_SNAPSHOT from the leader. RaftNode applies the snapshot, then resumes normal replication. The install path is itself Jepsen-tested via CrashDuringCompactionTest and FullDiskTest.

KnobDefaultNotes
dataDir (ClusterConfig)./data/walPut on SSD/NVMe.
loomcache.server.persistence.snapshot-threshold10_000Lower = more frequent snapshots, more IO.
loomcache.server.persistence.wal-directory./data/walDistinct per node.
loomcache.persistence.backup-dirunsetSeparate filesystem/object-store mount for Hot Backup.
loomcache.persistence.hot-backup-interval-seconds00 disables periodic Hot Backup.
idempotencyTtlMs60_000Raise for long-lived clients that retry older requests.

loom-spring-boot can back DistributedMap with a JPA repository (see CacheEntry, CacheEntryRepository, CacheNodeConfig). Useful when you need the cache to survive a full cluster wipe backed by an external relational store. This is independent of the Raft WAL and runs in the Spring module only.

Use this procedure when a machine dies, a disk is being replaced, or an operator needs to rehearse disaster recovery. It assumes persistence is enabled and every member has a stable nodeId.

Backup typeWhat it capturesRestore use
Filesystem copy of dataDirWAL segments, sidecar checksum files, Raft metadata, and local snapshots.Preferred for replacing a single failed machine with the same nodeId.
Hot Backup (persistence.backup-dir)Operator-triggered group snapshots plus a manifest under a separate backup directory.Point-in-time snapshot retention and off-host archival. Keep alongside full dataDir backups until a dedicated restore command is introduced.

Store backups outside the live disk. A local copy on the same volume does not protect against a machine or volume loss.

  1. Configure each member with a persistent dataDir.
  2. Configure loomcache.persistence.backup-dir on a different disk or mounted object-store path.
  3. Set loomcache.persistence.hot-backup-interval-seconds for periodic Hot Backup, or trigger CacheNode.triggerHotBackup() from an operator hook.
  4. Keep regular filesystem backups of the full dataDir tree, including WAL segment sidecar files and snapshot files.
  5. After every backup job, verify that the copied tree contains the node’s WAL directory, snapshots/, and checksum sidecars. For Hot Backup, verify the newest hot-backup-<nodeId>-<timestamp>.properties manifest and every referenced group snapshot file.

Replace one dead machine when quorum survived

Section titled “Replace one dead machine when quorum survived”

Use this path when the cluster still has a majority and is serving traffic.

  1. Confirm the failed member is absent from /api/cluster/status or the management dashboard.
  2. Provision a replacement host with the same LoomCache version, port configuration, TLS/security material, and nodeId as the failed member.
  3. Restore the failed member’s most recent dataDir backup to the same configured path on the replacement host.
  4. Preserve ownership and permissions for the process user.
  5. Start the replacement member.
  6. Watch startup logs for snapshot load, WAL replay, and Raft metadata validation. Startup must fail closed on corrupt WAL/snapshot data; do not delete files to make the node boot unless you intentionally choose partial recovery.
  7. Wait for the member to appear in cluster status and for partition migration/backup-promotion metrics to settle.
  8. Run read checks for critical maps and a low-risk write/read smoke test.

If no dataDir backup exists but the cluster still has quorum, start a clean replacement with the intended nodeId only after confirming the remaining members own the latest committed data. The cluster will rebuild ownership through normal membership and migration flows; the lost node-local WAL cannot be reconstructed on disk.

Use this path when too many machines died for the remaining cluster to make progress.

  1. Stop every surviving member to prevent mixed old/new recovery attempts.
  2. Pick a consistent backup point across enough members to form a majority. Prefer backups taken from the same maintenance window.
  3. Restore each selected member’s full dataDir backup to a replacement host with the same nodeId.
  4. Start only the restored majority first.
  5. Verify Raft startup, cluster state, and data smoke tests before admitting extra clean members.
  6. After the restored majority is healthy, add remaining members one at a time and let migration finish between adds.

Use cluster-data-recovery-policy=FULL when you require every persisted group to validate before startup. Use PARTIAL_MOST_RECENT or PARTIAL_MOST_COMPLETE only during an explicit incident response where availability is more important than rejecting an incomplete local recovery.

  • Latest backup timestamp is within the recovery point objective.
  • nodeId in configuration matches the restored data directory.
  • WAL .checksum sidecars and snapshot files are present.
  • Startup logs show successful snapshot load and WAL replay, or a deliberate partial-recovery decision.
  • Cluster state is ACTIVE.
  • Partition migration has no active failures.
  • Critical maps/lists/queues pass read checks.
  • A low-risk write/read smoke test succeeds after the member rejoins.

If validation fails, stop the replacement member, move the attempted dataDir aside, restore the previous known-good copy, and start again. Keep the failed attempt for forensic analysis; it contains the WAL/snapshot evidence needed to debug corruption or version-mismatch failures.