Skip to content

Resilience & Fault Tolerance

Distributed systems fail. LoomCache’s resilience story is built on four pillars: transparent client retries, Raft-level safety, phi-accrual failure detection, and client-side caching that can ride out brief outages.

The LoomClient handles transient failures through RetryPolicy:

  • Exponential backoff starting at 100 ms, capped at 5 s, with ±25 % jitter.
  • maxRetries defaults to 3 (configurable).
  • RESPONSE_REDIRECT responses update LeaderTracker transparently — the retry goes to the new leader.
  • RESPONSE_SERVER_BUSY triggers backoff without counting against the retry budget.
LoomClient client = LoomClient.builder()
.addSeed("127.0.0.1:5701")
.maxRetries(3)
.retryBaseDelay(Duration.ofMillis(100))
.build();
  • Keys are hashed to probable owners — the client sends directly to the owner rather than bouncing off a proxy.
  • For linearizable operations, the client routes to the current Raft leader via LeaderTracker.
  • If the leader changed, the server responds with RESPONSE_REDIRECT; the client updates the cache and retries.
  • Connection pools are per-node LIFO with idle eviction; ConnectionHealthMonitor closes stuck sockets.

Phi-Accrual Failure Detector

Dynamic probability (Φ) instead of static timeouts

Node A
Observer
Node B
Target
Suspicion Level (Φ)
0.00
THRESHOLD (8.0)
State: ALIVE
Network: HEALTHY

Instead of a fixed heartbeat timeout, LoomCache uses PhiAccrualFailureDetector to compute a suspicion level from observed heartbeat inter-arrival times. Higher phi → more suspicion → node eventually marked down. This adapts to changing network conditions — fewer false positives under bursty latency.

LoomClient client = LoomClient.builder()
.addSeed("127.0.0.1:5701")
.nearCacheEnabled(true)
.nearCacheTtl(Duration.ofSeconds(30))
.nearCacheMaxSize(10_000)
.build();

During brief outages, recently accessed keys stay available locally. Server-push invalidation (NEAR_CACHE_INVALIDATE) keeps the cache fresh under normal conditions; InvalidationSequenceTracker throws away late/out-of-order messages.

  • BackpressureController gates the command queue; when full, clients get RESPONSE_SERVER_BUSY and back off.
  • RateLimiter (loom-server/.../network) and PerClientRateLimiter (loom-server/.../ratelimit) cap global and per-client request rates.
  • ConnectionHealthMonitor sends TOO_MANY_CONNECTIONS to IPs exceeding their limit.

Use this one-pager during incidents. It describes the expected steady behavior; if observed behavior differs, preserve logs and WAL/snapshot files before restarting nodes.

ConditionWhat clients seeExpected data behaviorOperator actionMetrics and evidence
One follower lost in a 3-node cluster (quorum-1)Writes and linearizable reads continue through the leader; some clients may reconnect or receive redirects.Majority quorum remains, acknowledged writes stay durable, but there is no spare failure budget until the member returns.Restore or replace the member from the backup/restore runbook before doing maintenance on another node.Cluster status/member liveness, leader term, loomcache.raft.follower_lag_ms, connection errors.
Quorum lostMutating commands fail or time out with quorum/unavailable errors; clients back off and retry until retry budget is exhausted.No new writes are committed. This is fail-closed to avoid split-brain. Already acknowledged writes remain in the committed log.Restore enough members to regain majority. If machines are gone, follow the quorum-loss restore path in the persistence guide.Missing leader or leader stepping down, quorum-unavailable logs, stagnant commit index.
Network partition, majority/minority splitMajority side continues; minority side cannot commit writes and may redirect, fail requests, or lose leader lease.Raft pre-vote and majority commit prevent the minority from acknowledging conflicting writes.Route clients to the majority side, repair networking, then let followers catch up before declaring recovery.Membership view mismatch, phi suspicion, leader changes, follower lag, client redirect/error spikes.
Leader crash or controlled handoffBrief write pause; clients retry after RESPONSE_REDIRECT once a new leader is visible.Committed entries survive; uncommitted in-flight operations may need client retry/idempotency.Avoid manual restarts during election. Confirm a single leader and stable term before resuming maintenance.loomcache.raft.elections.total, current term, commit index movement, client retry counters.
Snapshot install to a lagging follower or new memberCluster stays online if quorum is healthy; the installing member may be slow or unavailable for local reads.Leader continues committing with the remaining quorum. The follower applies the snapshot, then resumes log replication.Do not remove additional voters while install is active. Watch duration and disk throughput; investigate if install stalls.loomcache.raft.snapshot_install_seconds, snapshot load/save logs, follower lag and match index.
Development partition migration or member add/removeIn non-production sharded certification runs, normal reads/writes may continue while P99 rises. Production rejects migration opcodes and migration ACKs until target Raft/WAL durable ACK and consensus ownership cutover are implemented.Certification migrations stream data and deduplicate chunks; the current ACK is not a target durability proof. Production stays fail-closed instead of moving ownership.Keep production on the single-group path. Use NO_MIGRATION for maintenance, and do not enable sharding for production traffic.loomcache.partition.migration.duration, .bytes, .keys, loomcache.partition.backup_promotion.count, dashboard topology.
Cluster state NO_MIGRATIONReads and writes continue; ownership movement is paused.Existing owners continue serving. New topology changes do not rebalance data until migration resumes.Use for short maintenance windows only. Return to ACTIVE after maintenance so backup placement can heal.Operational state in management topology, migration counters idle while membership changed.
Cluster state PASSIVEWrites are rejected; read-only checks may still be used for validation.No new mutating commands should be accepted.Use for controlled shutdown or recovery validation. Return to ACTIVE only after confirming quorum and persistence health.Rejection logs mentioning PASSIVE, topology operational state.
Sustained backpressure or rate limitingClients receive RESPONSE_SERVER_BUSY, retry with jitter, or see request timeouts after their retry budget.Safety is unchanged; latency SLO no longer applies until queues drain.Reduce client QPS, add capacity, inspect slow-operation and WAL fsync metrics, and avoid raising queues without heap headroom.loomcache.command.queue_wait_ns, server-busy responses, slow-operation detector records, WAL fsync logs.

CompositeDiscovery falls through multiple strategies — StaticDiscovery, DnsDiscovery, MulticastDiscovery, EnvironmentDiscovery, FileBasedDiscovery — with DiscoveryHealthChecker periodically probing cached addresses. Discovery caches survive transient DNS failures.

  • Pre-vote — prevents partitioned nodes from disrupting a stable leader with stale terms.
  • Leader lease — the leader can answer linearizable reads without a quorum round-trip as long as the lease holds.
  • Snapshot + WAL — crash recovery loads the latest snapshot and replays the WAL past its index.
  • Joint-consensus membership changes — add or remove servers through CONFIG_ADD_SERVER / CONFIG_REMOVE_SERVER without downtime.
  • Graceful shutdownGracefulShutdownCoordinator drains work before handing off leadership via RAFT_TIMEOUT_NOW.