Resilience & Fault Tolerance
Distributed systems fail. LoomCache’s resilience story is built on four pillars: transparent client retries, Raft-level safety, phi-accrual failure detection, and client-side caching that can ride out brief outages.
Client retry with exponential backoff
Section titled “Client retry with exponential backoff”The LoomClient handles transient failures through RetryPolicy:
- Exponential backoff starting at 100 ms, capped at 5 s, with ±25 % jitter.
maxRetriesdefaults to 3 (configurable).RESPONSE_REDIRECTresponses updateLeaderTrackertransparently — the retry goes to the new leader.RESPONSE_SERVER_BUSYtriggers backoff without counting against the retry budget.
LoomClient client = LoomClient.builder() .addSeed("127.0.0.1:5701") .maxRetries(3) .retryBaseDelay(Duration.ofMillis(100)) .build();Smart routing & failover
Section titled “Smart routing & failover”- Keys are hashed to probable owners — the client sends directly to the owner rather than bouncing off a proxy.
- For linearizable operations, the client routes to the current Raft leader via
LeaderTracker. - If the leader changed, the server responds with
RESPONSE_REDIRECT; the client updates the cache and retries. - Connection pools are per-node LIFO with idle eviction;
ConnectionHealthMonitorcloses stuck sockets.
Phi-accrual failure detection
Section titled “Phi-accrual failure detection”Phi-Accrual Failure Detector
Dynamic probability (Φ) instead of static timeouts
Instead of a fixed heartbeat timeout, LoomCache uses PhiAccrualFailureDetector to compute a suspicion level from
observed heartbeat inter-arrival times. Higher phi → more suspicion → node eventually marked down. This adapts to
changing network conditions — fewer false positives under bursty latency.
Near cache as a soft buffer
Section titled “Near cache as a soft buffer”LoomClient client = LoomClient.builder() .addSeed("127.0.0.1:5701") .nearCacheEnabled(true) .nearCacheTtl(Duration.ofSeconds(30)) .nearCacheMaxSize(10_000) .build();During brief outages, recently accessed keys stay available locally. Server-push invalidation (NEAR_CACHE_INVALIDATE)
keeps the cache fresh under normal conditions; InvalidationSequenceTracker throws away late/out-of-order messages.
Server-side backpressure
Section titled “Server-side backpressure”BackpressureControllergates the command queue; when full, clients getRESPONSE_SERVER_BUSYand back off.RateLimiter(loom-server/.../network) andPerClientRateLimiter(loom-server/.../ratelimit) cap global and per-client request rates.ConnectionHealthMonitorsendsTOO_MANY_CONNECTIONSto IPs exceeding their limit.
Operator degradation matrix
Section titled “Operator degradation matrix”Use this one-pager during incidents. It describes the expected steady behavior; if observed behavior differs, preserve logs and WAL/snapshot files before restarting nodes.
| Condition | What clients see | Expected data behavior | Operator action | Metrics and evidence |
|---|---|---|---|---|
One follower lost in a 3-node cluster (quorum-1) | Writes and linearizable reads continue through the leader; some clients may reconnect or receive redirects. | Majority quorum remains, acknowledged writes stay durable, but there is no spare failure budget until the member returns. | Restore or replace the member from the backup/restore runbook before doing maintenance on another node. | Cluster status/member liveness, leader term, loomcache.raft.follower_lag_ms, connection errors. |
| Quorum lost | Mutating commands fail or time out with quorum/unavailable errors; clients back off and retry until retry budget is exhausted. | No new writes are committed. This is fail-closed to avoid split-brain. Already acknowledged writes remain in the committed log. | Restore enough members to regain majority. If machines are gone, follow the quorum-loss restore path in the persistence guide. | Missing leader or leader stepping down, quorum-unavailable logs, stagnant commit index. |
| Network partition, majority/minority split | Majority side continues; minority side cannot commit writes and may redirect, fail requests, or lose leader lease. | Raft pre-vote and majority commit prevent the minority from acknowledging conflicting writes. | Route clients to the majority side, repair networking, then let followers catch up before declaring recovery. | Membership view mismatch, phi suspicion, leader changes, follower lag, client redirect/error spikes. |
| Leader crash or controlled handoff | Brief write pause; clients retry after RESPONSE_REDIRECT once a new leader is visible. | Committed entries survive; uncommitted in-flight operations may need client retry/idempotency. | Avoid manual restarts during election. Confirm a single leader and stable term before resuming maintenance. | loomcache.raft.elections.total, current term, commit index movement, client retry counters. |
| Snapshot install to a lagging follower or new member | Cluster stays online if quorum is healthy; the installing member may be slow or unavailable for local reads. | Leader continues committing with the remaining quorum. The follower applies the snapshot, then resumes log replication. | Do not remove additional voters while install is active. Watch duration and disk throughput; investigate if install stalls. | loomcache.raft.snapshot_install_seconds, snapshot load/save logs, follower lag and match index. |
| Development partition migration or member add/remove | In non-production sharded certification runs, normal reads/writes may continue while P99 rises. Production rejects migration opcodes and migration ACKs until target Raft/WAL durable ACK and consensus ownership cutover are implemented. | Certification migrations stream data and deduplicate chunks; the current ACK is not a target durability proof. Production stays fail-closed instead of moving ownership. | Keep production on the single-group path. Use NO_MIGRATION for maintenance, and do not enable sharding for production traffic. | loomcache.partition.migration.duration, .bytes, .keys, loomcache.partition.backup_promotion.count, dashboard topology. |
Cluster state NO_MIGRATION | Reads and writes continue; ownership movement is paused. | Existing owners continue serving. New topology changes do not rebalance data until migration resumes. | Use for short maintenance windows only. Return to ACTIVE after maintenance so backup placement can heal. | Operational state in management topology, migration counters idle while membership changed. |
Cluster state PASSIVE | Writes are rejected; read-only checks may still be used for validation. | No new mutating commands should be accepted. | Use for controlled shutdown or recovery validation. Return to ACTIVE only after confirming quorum and persistence health. | Rejection logs mentioning PASSIVE, topology operational state. |
| Sustained backpressure or rate limiting | Clients receive RESPONSE_SERVER_BUSY, retry with jitter, or see request timeouts after their retry budget. | Safety is unchanged; latency SLO no longer applies until queues drain. | Reduce client QPS, add capacity, inspect slow-operation and WAL fsync metrics, and avoid raising queues without heap headroom. | loomcache.command.queue_wait_ns, server-busy responses, slow-operation detector records, WAL fsync logs. |
Discovery resilience
Section titled “Discovery resilience”CompositeDiscovery falls through multiple strategies — StaticDiscovery, DnsDiscovery, MulticastDiscovery,
EnvironmentDiscovery, FileBasedDiscovery — with DiscoveryHealthChecker periodically probing cached addresses.
Discovery caches survive transient DNS failures.
Raft foundations
Section titled “Raft foundations”- Pre-vote — prevents partitioned nodes from disrupting a stable leader with stale terms.
- Leader lease — the leader can answer linearizable reads without a quorum round-trip as long as the lease holds.
- Snapshot + WAL — crash recovery loads the latest snapshot and replays the WAL past its index.
- Joint-consensus membership changes — add or remove servers through
CONFIG_ADD_SERVER/CONFIG_REMOVE_SERVERwithout downtime. - Graceful shutdown —
GracefulShutdownCoordinatordrains work before handing off leadership viaRAFT_TIMEOUT_NOW.