Resilience & Fault Tolerance
In distributed systems, failures are guaranteed. Networks drop packets, JVMs pause for garbage collection, and entire availability zones go offline. LoomCache is designed with active and passive resilience mechanisms to survive the chaos.
Resilience Circuit Breaker
LoomCache automatically isolates failing partitions. Connection pools fail fast to prevent cascading timeouts across the cluster.
Traffic flows freely to the node.
Node isolated. Requests immediately fail fast.
Testing node health with limited probe traffic.
Per-Peer Circuit Breakers
Section titled “Per-Peer Circuit Breakers”Phi-Accrual Failure Detector
Dynamic probability (Φ) instead of static timeouts
When the cluster detects that a node is not responding to heartbeats (using a Phi-Accrual failure detector), the connection pool actively trips the circuit breaker for that specific route.
LoomCache implements per-peer circuit breaker isolation, meaning a single failing node cannot cascade failures to healthy routes. Each peer connection has its own independent breaker with three states:
| State | Behavior |
|---|---|
| CLOSED | Normal operation. Failures counted. |
| OPEN | All requests instantly rejected. Timer starts. |
| HALF_OPEN | Limited probe requests (default: 3). Success → CLOSED, failure → OPEN. |
Configuration
Section titled “Configuration”CircuitBreakerConfig.builder() .enabled(true) .failureThreshold(5) // 5 consecutive failures → OPEN .recoveryTimeoutMs(30000) // 30s before HALF_OPEN .halfOpenMaxRequests(3) // 3 probes in HALF_OPEN .build();This mechanism provides critical fail-fast semantics. Instead of thread pools saturating while waiting on TCP timeouts for a dead node, the LoomClient instantly rejects the request, protecting the upstream application.
Phi-Accrual Failure Detection
Section titled “Phi-Accrual Failure Detection”LoomCache uses Phi-Accrual Failure Detectors (inspired by Akka) instead of simple heartbeat timeouts. The detector tracks inter-arrival times of heartbeats and computes a suspicion level (φ) based on statistical deviation:
- φ < threshold: Node is healthy
- φ ≥ threshold: Node is suspected failed
This approach adapts automatically to network conditions — fewer false positives in high-latency environments (WAN/cloud) compared to fixed-timeout detection.
| Setting | Default | Description |
|---|---|---|
phiAccrualThreshold | 8.0 | Suspicion threshold. Higher = fewer false positives. |
enableAdaptiveThreshold | true | Auto-adjust for observed latency variance. |
Tuning: LAN environments use threshold 8.0 with adaptive disabled. WAN/cloud deployments benefit from 8.5–9.0 with adaptive enabled.
Client Retry with Exponential Backoff
Section titled “Client Retry with Exponential Backoff”The LoomClient handles transient failures transparently:
- Exponential backoff: 100ms base → 5s cap with ±25% jitter
- Max retries: 3 (configurable)
- Leader election handling: Retries every 100ms for up to 15 seconds
- RESPONSE_REDIRECT: Cached leader address updated transparently
LoomClient client = LoomClient.builder() .addSeed("127.0.0.1:5701") .maxRetries(3) .retryBaseDelay(Duration.ofMillis(100)) .build();Connection Pool Recovery
Section titled “Connection Pool Recovery”While the circuit breaker is Open, LoomCache runs background health probes (Half-Open state). Once a node comes back online and stabilizes, traffic is seamlessly allowed back through the primary connection pool routes.
The connection pool itself uses LIFO ordering per node (5–20 connections default) with idle eviction after 5 minutes, ensuring recently-verified connections are reused first.
Discovery Resilience
Section titled “Discovery Resilience”The discovery subsystem includes multiple fault-tolerance features:
- Retry backoff: Exponential backoff (100ms → 500ms) for DNS/discovery failures
- TTL-aware caching: Retain discovered addresses during transient DNS outages
- Health checking: Background TCP connectivity probes on cached addresses (every 30s)
- Composite discovery: Fall through multiple strategies (static → DNS → multicast)
Raft-Level Resilience
Section titled “Raft-Level Resilience”The Raft consensus layer itself provides foundational resilience:
- Pre-vote: Prevents partitioned nodes from disrupting stable clusters with stale terms
- Leader lease: Serves linearizable reads without quorum round-trips
- Snapshot + WAL replay: Full crash recovery from disk
- Dynamic membership: Add/remove nodes without downtime via configuration changes
- Graceful shutdown: 8-phase drain with leadership transfer via TimeoutNow RPC