Resilience & Fault Tolerance

This page describes how LoomCache handles node failures, leader churn, and network partitions. Four mechanisms work together: bounded client retries for known-safe failures, Raft-level safety, phi-accrual failure detection, and client-side caching that keeps recently accessed entries available during brief outages.

Client retry with exponential backoff

LoomClient retries inside the smart-routing send loop, which handles two distinct failure classes:

Transport-level failures — the router’s own backoff loop:

Retries IOException and TimeoutException with exponential backoff starting at 100 ms (ClientConfig.DEFAULT_RETRY_BASE_DELAY), capped at MAX_BACKOFF_MS = 5 s, with ±25 % jitter (JITTER_PERCENTAGE = 0.25).
maxRetries defaults to 3 (ClientConfig.DEFAULT_MAX_RETRIES, configurable).
This layer does not interpret wire response codes; that is the protocol-level layer’s responsibility.

Protocol-level responses — handled in the same loop:

Backpressure, too-many-connections, and graceful-drain responses mark the target node to avoid on the next attempt, then fall through to backoff. Each occurrence consumes a retry attempt from the budget.
Leader redirects do not consume a retry attempt (attempt-- is applied before continuing). A separate redirectCount counter, capped at maxRedirects = 20 (ClientConfig.DEFAULT_MAX_REDIRECTS), limits redirect chains independently of the retry budget. On each redirect the new leader address is stored in the client leader cache so subsequent requests route correctly.
Unknown operation outcomes are not retried automatically. The client surfaces UnknownOperationOutcomeException so callers can reconcile state or retry only with application-level idempotency.

Configure the retry budget and base delay on the client builder:

LoomClient client = LoomClient.builder()
    .addSeed("127.0.0.1:5701")
    .maxRetries(3)
    .retryBaseDelay(Duration.ofMillis(100))
    .build();

Smart routing & failover

Keys are hashed to probable owners, so the client sends each request directly to the expected owner instead of through an intermediary.
For linearizable operations, the client routes to the current Raft leader via the client leader cache.
If the leader changed, the server returns the current leader address; the client updates the cache and retries without consuming a retry attempt.
When no preferred owner is known, the client falls back to round-robin across available nodes.
The client keeps one persistent, multiplexed connection per cluster node (correlation-ID pipelined), re-established automatically on failure.

Phi-accrual failure detection

Phi-Accrual Failure Detector

Dynamic probability (Φ) instead of static timeouts

Node A

Observer

Node B

Target

Suspicion Level (Φ)

0.00

THRESHOLD (8.0)

State: ALIVE

Network: HEALTHY

The phi-accrual failure detector is a server-side, inter-node membership mechanism used by the cluster membership layer (not a client-facing API). Instead of a fixed heartbeat timeout, it computes a suspicion level (phi) from observed heartbeat inter-arrival times. Higher phi means more suspicion; once phi exceeds the threshold the node is marked down. This adapts to changing network conditions — fewer false positives under bursty latency.

Key defaults for the server-side phi-accrual detector:

DEFAULT_PHI_THRESHOLD = 8.0
DEFAULT_SAMPLE_SIZE = 100 heartbeat intervals
Adaptive threshold range: 5.0–16.0 (adjusted based on observed variance)

Near cache as a soft buffer

Enable the near cache on the client builder:

LoomClient client = LoomClient.builder()
    .addSeed("127.0.0.1:5701")
    .nearCacheEnabled(true)
    .nearCacheTtl(Duration.ofSeconds(30))   // default is Duration.ZERO (no expiry)
    .nearCacheMaxSize(10_000)
    .build();

During brief outages, recently accessed keys stay available locally. Server-push invalidation keeps the cache fresh under normal conditions, and sequence tracking discards late or out-of-order invalidations.

Server-side backpressure

Command-queue backpressure returns server-busy responses so clients can back off.
Spring Boot REST rate limiting can cap request rates when the HTTP surface is enabled.
Server-side connection limits reject clients that exceed the global or per-client-IP limit with a typed error.

Operator degradation matrix

Use this matrix during incidents. It describes the expected steady behavior; if observed behavior differs, preserve logs and WAL/snapshot files before restarting nodes.

Condition	What clients see	Expected data behavior	Operator action	Metrics and evidence
One follower lost in a 3-node cluster (`quorum-1`)	Writes and linearizable reads continue through the leader; some clients may reconnect or receive redirects.	Majority quorum remains, acknowledged writes stay durable, but there is no spare failure budget until the member returns.	Restore or replace the member from the backup/restore runbook before doing maintenance on another node.	Cluster status/member liveness, leader term, `loomcache.raft.follower_lag_ms`, connection errors.
Quorum lost	Mutating commands fail or time out with quorum/unavailable errors; clients back off and retry until retry budget is exhausted.	No new writes are committed. This is fail-closed to avoid split-brain. Already acknowledged writes remain in the committed log.	Restore enough members to regain majority. If machines are gone, follow the quorum-loss restore path in the persistence guide.	Missing leader or leader stepping down, quorum-unavailable logs, stagnant commit index.
Network partition, majority/minority split	Majority side continues; minority side cannot commit writes and may redirect, fail requests, or lose leader lease.	Raft pre-vote and majority commit prevent the minority from acknowledging conflicting writes.	Route clients to the majority side, repair networking, then let followers catch up before declaring recovery.	Membership view mismatch, phi suspicion, leader changes, follower lag, client redirect/error spikes.
Leader crash or controlled handoff	Brief write pause; clients retry after a leader redirect once a new leader is visible.	Committed entries survive; uncommitted in-flight operations may need client retry/idempotency.	Avoid manual restarts during election. Confirm a single leader and stable term before resuming maintenance.	`loomcache.raft.elections.total`, current term, commit index movement, client retry counters.
Snapshot install to a lagging follower or new member	Cluster stays online if quorum is healthy; the installing member may be slow or unavailable for local reads.	Leader continues committing with the remaining quorum. The follower applies the snapshot, then resumes log replication.	Do not remove additional voters while install is active. Watch duration and disk throughput; investigate if install stalls.	`loomcache.raft.snapshot_install_seconds`, snapshot load/save logs, follower lag and match index.
Sharded group rebalance from operator-invoked group scaling	In non-production sharded validation runs, normal reads/writes may continue while P99 rises. Node membership add/remove does not trigger this rebalance path. Explicitly release-gated production sharded rebalances require `-Dloomcache.migration.durableChunks=true` or config key `loomcache.migration.durable-chunks=true`, target Raft/WAL chunk ACK, and raft-0 ownership cutover.	Validation migrations stream data and deduplicate chunks; durable-chunk mode commits incoming chunks through the target Raft/WAL path before ACK. Production sharding otherwise stays fail-closed.	Keep production on the single-group path unless a release explicitly validates sharding, durable migration chunks, and per-group recovery evidence.	`loomcache.partition.migration.duration_seconds`, `.bytes`, `.keys`, `loomcache.partition.backup_promotion.count`, Grafana topology panels.
Cluster state `NO_MIGRATION`	Reads and writes continue; ownership movement is paused.	Existing owners continue serving. New topology changes do not rebalance data until migration resumes.	Non-production or emergency admin-plane only; production REST cluster-state mutations are disabled, so use edge routing and health/readiness controls for maintenance.	Operational state in management topology, migration counters idle while membership changed.
Application-edge write fence	Writes are stopped before they reach LoomCache; read-only checks may still be used for validation.	No new mutating commands should be accepted by the deployment edge.	Use for production rollback, controlled shutdown, or recovery validation. Production disables REST admin cluster-state mutations, so drive this from the application/router and ingress controls.	Router/ingress change evidence, write rejection or drain logs, topology and persistence health.
Sustained backpressure or rate limiting	Clients receive server-busy responses, retry with jitter, or see request timeouts after their retry budget.	Safety is unchanged; latency SLO no longer applies until queues drain.	Reduce client QPS, add capacity, inspect slow-operation and WAL fsync metrics, and avoid raising queues without heap headroom.	`loomcache.command.queue_wait_seconds`, server-busy responses, slow-operation detector records, WAL fsync logs.

Discovery resilience

Static seed-list discovery uses explicit direct-connect seed lists rendered by deployment tooling. Production discovery does not compose fallback discovery strategies.

Raft foundations

Pre-vote — prevents partitioned nodes from disrupting a stable leader with stale terms.
Leader lease — the leader can answer linearizable reads without a quorum round-trip as long as the lease holds.
Snapshot + WAL — crash recovery loads the newest valid snapshot and replays the WAL past its index.
Single-server membership changes — add or remove servers one at a time through the guarded membership-change path only after that path has been explicitly enabled and exercised in the target environment.
Graceful shutdown — shutdown coordination drains work before asking followers to trigger a clean leadership handoff.

System architecture — how Raft replication, the WAL, and snapshots fit together.
Persistence & durability — WAL, snapshots, backups, and crash recovery in detail.
Chaos testing — how failure handling is exercised in tests.

LoomCache is an independent open-source project. It is not affiliated with, endorsed by, or sponsored by Hazelcast, Inc. or by any other company whose products are named in this documentation. “Hazelcast” is a trademark of Hazelcast, Inc.; references to it are nominative and describe only migration and comparison. All other product and company names are trademarks of their respective owners and are used for identification purposes only.

Resilience & Fault Tolerance

Client retry with exponential backoff

Smart routing & failover

Phi-accrual failure detection

Phi-Accrual Failure Detector

Near cache as a soft buffer

Server-side backpressure

Operator degradation matrix

Discovery resilience

Raft foundations

Related