Persistence & Durability

LoomCache combines a write-ahead log with periodic state-machine snapshots for durability. The durability path is implemented in the server module and is exercised by crash-recovery and snapshot-restore tests.

ClusterConfig.dataDir defaults to null (in-memory only — useful for tests). Of the shipped config files, config/loomcache-dev.properties enables persistence (loomcache.persistence.enabled=true, loomcache.persistence.wal-directory=./data/wal); config/loomcache.yml pre-populates the wal-directory path but ships with persistence.enabled: false. Spring Boot enables it via loomcache.server.persistence.enabled plus loomcache.server.persistence.wal-directory.

The animation below traces the durability lifecycle — writes, snapshot compaction, crash, and recovery:

Durability & Crash Recovery

Write ops append entries to the WAL — every commit is fsynced.

writesnapshotwritecrashrecover

cache nodelocal durable dataDirrunning

WAL · append-only log fsync

full snapshotsretains 2 recent

emptystaging

→

emptyfinal

state machine

in-memory state · volatile

on restart○A · Load the newest valid full snapshot○B · Replay the WAL tail○C · Serve traffic again

Commit sequence

Client sends a mutating map command.
Leader appends the request as a Raft log record and writes the entry locally before follower replication.
The durability layer appends the entry to the per-node WAL with a trailing checksum. When syncOnCommit is true (default), it immediately forces the file to durable storage; fsync can be disabled via -Dloomcache.wal.disableFsync=true, which drops the guarantee to WRITE (OS page cache only); a third level, NONE, offers no durability guarantee. Raft WALs require unbatched fsync durability, so deferred-fsync batching is rejected for committed Raft entries. Production startup requires FSYNC — both WRITE (i.e. disableFsync=true) and NONE are rejected.
After local durability is finalized, the leader sends AppendEntries to followers.
Once a majority of cluster members has acknowledged the entry, the leader advances commitIndex.
The state-machine applier decodes the command and applies it through the data-operation handler.
The response is returned to the client.

The WAL fsync order ensures that a crash after step 2 but before step 3 is indistinguishable from the command never arriving.

WAL files

Each node owns a WAL root under the configured data directory. The default single Raft group stores its records in the node WAL root; sharded validation groups use separate child directories.
Each group WAL directory contains a single append-only log file where every record carries integrity metadata for replay validation.
Records are validated on replay. A torn-tail CRC mismatch (corruption at the very end of the log) is silently truncated under the default torn-tail recovery mode — this is the expected behavior after a crash mid-write. A mid-log corruption aborts startup with an IOException unless -Dloomcache.wal.allow-truncate-on-mid-log-corruption=true is explicitly set.
WAL replay reads records in index order.
Automatic compaction creates a full snapshot and compacts WAL entries already covered by that snapshot, gated by the committed-entry threshold and snapshot compaction policy.

Snapshots

Automatic compaction triggers a snapshot once committed entries since the last snapshot reach snapshotThreshold (standalone config key loomcache.persistence.snapshot-threshold, default 10,000 — wired through Raft group configuration). There is no separate scheduler class.
Registered data structures are captured and streamed into durable snapshot storage.
Snapshots are full point-in-time images in this release; incremental/delta snapshots are not shipped.
Every snapshot carries a trailing SHA-256 checksum; a snapshot missing or failing that checksum is rejected on load.

Recovery

On node startup:

Create and validate the node WAL directory and its snapshot child directory.
Attach WAL and Raft metadata storage, then recover term / voted-for / commit-index metadata.
Validate and load the newest valid snapshot when one exists.
Replay WAL records newer than the snapshot index, or replay the WAL from the beginning when no snapshot exists yet.
Only after the above succeeds does the node begin to accept traffic.

Under the default FULL recovery policy, failure to replay — WAL corruption, snapshot mismatch, metadata inconsistency — aborts startup with a diagnostic. Partial recovery modes are explicit incident-response choices and are documented below.

Install snapshot

Followers that fall too far behind receive an install-snapshot transfer from the leader. The follower applies the snapshot, then resumes normal replication. The install and crash-recovery paths are covered by release-blocking recovery tests.

Tuning

Knob	Default	Notes
`dataDir` (`ClusterConfig`)	`null`	`null` = in-memory; sample configs set `./data/wal`. Put on SSD/NVMe.
`loomcache.persistence.snapshot-threshold` (standalone)	`10_000`	Lower = more frequent snapshots, more IO.
`loomcache.server.persistence.snapshot-threshold` (Spring Boot)	`10_000`	Spring Boot equivalent of the above.
`loomcache.persistence.wal-directory` (standalone)	`persistence`	Distinct per node.
`loomcache.server.persistence.wal-directory` (Spring Boot)	`persistence`	Spring Boot equivalent of the above.
`loomcache.persistence.backup-dir` (standalone)	unset	Separate filesystem/object-store mount for standalone Hot Backup.
`loomcache.persistence.hot-backup-interval-seconds` (standalone)	`0`	`0` disables periodic standalone Hot Backup.
`idempotencyTtlMs`	`900_000`	Aligned with 2PC completed-decision retention; raise for long-lived clients that retry older requests.

External store extensions

Packaged external-store adapters are development and validation surfaces only. loomcache.profile=production rejects generic JDBC MapStore declarations. The optional Spring default-map JPA write-through bean still validates production datasource safety when present, but the default map does not install that JPA bridge in production. Custom MapStore implementations can be used in production only when installed through a clustered server extension, which provides leader-owned external writes, snapshot/graceful-drain recovery for write-behind queues, and the leader-owned read-through fill path. Bare maps without that wiring reject production MapStore use.

Write-behind queues are preserved by full snapshots and graceful drains. Unflushed leader-local write-behind entries outside that window can be lost on failover, so use write-through for data that must reach the external store before the write is acknowledged.

Clustered maps and queues support pluggable external store interfaces that bridge in-memory state with an external database when installed through the supported server-extension path.

MapStore

MapStore<K, V> is the primary extension interface. Implement load(K), store(K, V), and delete(K), plus optional batch variants loadAll, storeAll, and deleteAll. Attach it to a map through the clustered server-extension configuration for the member process.

On maps installed through that clustered path, the Java client can call LoomMap.loadAll(...) to trigger a leader-owned, Raft-replicated load from the store. Loaded entries are not written back to the store, maps without a MapStore no-op, and partitioned/sharded maps reject this operation. LoomMap.putTransient(...) is the inverse escape hatch: it writes a replicated in-memory value while bypassing external write-through.

GenericMapStoreConfig (loom-common) carries the operational parameters:

Parameter	Default	Notes
`writeThrough`	—	Synchronous write to the external store on every put/delete. Mutually exclusive with `writeBehind`.
`writeBehind`	—	Async write queue; batched flushes at `writeDelay` intervals.
`writeBatchSize`	`100`	Maximum entries per flush batch.
`writeDelay`	`1s`	Flush interval for write-behind mode.
`loadMode`	`LAZY`	`LAZY` reads on first access; `EAGER` pre-loads on map registration.
`maxLoadKeys`	`100_000`	Maximum keys returned by `loadAllKeys()`.

JdbcGenericMapStore is the built-in JDBC implementation. It supports only allowlisted JDBC drivers: oracle.jdbc.OracleDriver, oracle.jdbc.driver.OracleDriver, and org.h2.Driver. SQL DML uses Oracle-compatible MERGE INTO syntax; H2 is supported in MODE=Oracle for tests.

QueueStore

QueueStore<T> provides an equivalent extension point for queues. A queue with an attached QueueStore cannot be safely snapshotted or restored through the standard Raft snapshot path; startup/runtime validation rejects that shape with an IllegalStateException.

Optional JPA write-through

loom-spring-boot can back the embedded REST-facing default map with a packaged JPA write-through path. Use it when you need that default map to survive a full cluster wipe, backed by an external relational store. This is independent of the Raft WAL and runs in the Spring module only. It is non-production only in this release; the production profile keeps the default map on the in-memory/Raft-backed path and does not install this packaged JPA write-through bridge.

This JPA store is not shared by Spring Cache abstraction namespaces. LoomCacheManager creates client-backed LoomSpringCache instances whose cache names map to LoomCache map names; they do not use WriteThroughCacheStore.

JPA write-through validation expects an externally managed Oracle schema. The packaged default is spring.jpa.hibernate.ddl-auto=validate; use none if validation is handled outside Hibernate. In production, unsafe or local datasource families are rejected if this optional bean is present; an external Oracle datasource with validate or none can pass preflight, but it is still not installed as production cache persistence. Test and local H2 profiles may opt into update or create-drop explicitly.

Backup and restore runbook

Use this procedure when a machine dies, a disk is being replaced, or an operator needs to rehearse disaster recovery. It assumes persistence is enabled and every member has a stable nodeId.

Backup types

Backup type	What it captures	Restore use
Filesystem copy of `dataDir`	The per-node WAL tree, durable Raft metadata, file-lock state, and any local snapshots, each carrying integrity metadata.	Preferred for replacing a single failed machine with the same `nodeId`.
Hot Backup (`persistence.backup-dir`)	Operator-triggered group snapshots plus a manifest under a separate backup directory. Production requires HMAC manifest signing.	Point-in-time snapshot retention and off-host archival. Keep alongside full `dataDir` backups until a dedicated restore command is introduced.

Store backups outside the live disk. A local copy on the same volume does not protect against a machine or volume loss.

Routine backup

Configure each member with a persistent dataDir.
For standalone deployments, configure loomcache.persistence.backup-dir on a different disk or mounted object-store path.
For standalone deployments, set loomcache.persistence.hot-backup-interval-seconds for periodic Hot Backup, or trigger a hot backup from an operator hook. For Docker/Spring Boot deployments, keep full dataDir backups as the supported restore input until scheduled Hot Backup keys are exposed through LoomProperties.
Keep regular filesystem backups of the full dataDir tree, including each node’s WAL, metadata, lock, and snapshot files.
After every backup job, verify that the copied tree contains the node’s WAL directory and the WAL .log file. Snapshot files may be absent before the first compaction threshold is reached; when present, include them in the backup. For Hot Backup, verify the newest hot-backup-<nodeId>-<timestamp>.properties manifest, the matching .complete marker, every referenced group snapshot file, and the marker’s manifestHmacSha256 / snapshotSetHmacSha256 fields.

Production Hot Backup requires a Base64-encoded 32+ byte HMAC key in LOOMCACHE_BACKUP_MANIFEST_HMAC_KEY_BASE64; unsigned production hot backups fail closed. The -Dloomcache.backup.manifest.hmac.key.base64 form is for local tests and isolated dry runs only, because JVM arguments can be exposed through process listings.

Replace one dead machine when quorum survived

Use this path when the cluster still has a majority and is serving traffic.

Confirm the failed member is absent from /api/cluster/status.
Provision a replacement host with the same LoomCache version, port configuration, TLS/security material, and nodeId as the failed member.
Restore the failed member’s most recent dataDir backup to the same configured path on the replacement host.
Preserve ownership and permissions for the process user.
Start the replacement member.
Watch startup logs for snapshot load, WAL replay, and Raft metadata validation. Startup must fail closed on corrupt WAL/snapshot data; do not delete files to make the node boot unless you intentionally choose partial recovery.
Wait for the member to appear in cluster status and for partition migration/backup-promotion metrics to settle.
Run production client/Raft read checks for critical maps and a low-risk write/read smoke test, accepting the restored backup point’s RPO.

If no dataDir backup exists but the cluster still has quorum, start a clean replacement with the intended nodeId only after confirming the remaining members own the latest committed data. The cluster will rebuild ownership through normal membership and migration flows; the lost node-local WAL cannot be reconstructed on disk.

Restore after quorum loss

Use this path when too many machines died for the remaining cluster to make progress.

Stop every surviving member to prevent mixed old/new recovery attempts.
Pick a consistent backup point across enough members to form a majority. Prefer backups taken from the same maintenance window.
Restore each selected member’s full dataDir backup to a replacement host with the same nodeId.
Start only the restored majority first.
Verify Raft startup, cluster state, and data smoke tests before admitting extra clean members.
After the restored majority is healthy, add remaining members one at a time and let migration finish between adds.

ClusterDataRecoveryPolicy controls how strict the node is during recovery. The default is FULL, which requires every persisted group to validate before startup. Use PARTIAL_MOST_RECENT or PARTIAL_MOST_COMPLETE only during an explicit incident response where availability is more important than rejecting an incomplete local recovery.

Validation checklist

Latest backup timestamp is within the recovery point objective.
nodeId in configuration matches the restored data directory.
The restored node WAL and metadata files are present; snapshot files are present only after the node has compacted at least once.
Startup logs show successful metadata recovery plus snapshot/WAL recovery, or a deliberate partial-recovery decision.
Cluster state is ACTIVE.
Partition migration has no active failures.
Critical maps/lists/queues pass read checks.
A low-risk write/read smoke test succeeds after the member rejoins.

Rollback

If validation fails, stop the replacement member, move the attempted dataDir aside, restore the previous known-good copy, and start again. Keep the failed attempt for forensic analysis; it contains the WAL/snapshot evidence needed to debug corruption or version-mismatch failures.

LoomCache is an independent open-source project. It is not affiliated with, endorsed by, or sponsored by Hazelcast, Inc. or by any other company whose products are named in this documentation. “Hazelcast” is a trademark of Hazelcast, Inc.; references to it are nominative and describe only migration and comparison. All other product and company names are trademarks of their respective owners and are used for identification purposes only.