Skip to content

Resilience & Fault Tolerance

In distributed systems, failures are guaranteed. Networks drop packets, JVMs pause for garbage collection, and entire availability zones go offline. LoomCache is designed with active and passive resilience mechanisms to survive the chaos.

Resilience Circuit Breaker

LoomCache automatically isolates failing partitions. Connection pools fail fast to prevent cascading timeouts across the cluster.

CLOSED (Normal)

Traffic flows freely to the node.

OPEN (Failing)

Node isolated. Requests immediately fail fast.

HALF-OPEN (Recovery)

Testing node health with limited probe traffic.

Client
ReplicaOnline

Phi-Accrual Failure Detector

Dynamic probability (Φ) instead of static timeouts

Node A
Observer
Node B
Target
Suspicion Level (Φ)
0.00
THRESHOLD (8.0)
State: ALIVE
Network: HEALTHY

When the cluster detects that a node is not responding to heartbeats (using a Phi-Accrual failure detector), the connection pool actively trips the circuit breaker for that specific route.

LoomCache implements per-peer circuit breaker isolation, meaning a single failing node cannot cascade failures to healthy routes. Each peer connection has its own independent breaker with three states:

StateBehavior
CLOSEDNormal operation. Failures counted.
OPENAll requests instantly rejected. Timer starts.
HALF_OPENLimited probe requests (default: 3). Success → CLOSED, failure → OPEN.
CircuitBreakerConfig.builder()
.enabled(true)
.failureThreshold(5) // 5 consecutive failures → OPEN
.recoveryTimeoutMs(30000) // 30s before HALF_OPEN
.halfOpenMaxRequests(3) // 3 probes in HALF_OPEN
.build();

This mechanism provides critical fail-fast semantics. Instead of thread pools saturating while waiting on TCP timeouts for a dead node, the LoomClient instantly rejects the request, protecting the upstream application.

LoomCache uses Phi-Accrual Failure Detectors (inspired by Akka) instead of simple heartbeat timeouts. The detector tracks inter-arrival times of heartbeats and computes a suspicion level (φ) based on statistical deviation:

  • φ < threshold: Node is healthy
  • φ ≥ threshold: Node is suspected failed

This approach adapts automatically to network conditions — fewer false positives in high-latency environments (WAN/cloud) compared to fixed-timeout detection.

SettingDefaultDescription
phiAccrualThreshold8.0Suspicion threshold. Higher = fewer false positives.
enableAdaptiveThresholdtrueAuto-adjust for observed latency variance.

Tuning: LAN environments use threshold 8.0 with adaptive disabled. WAN/cloud deployments benefit from 8.5–9.0 with adaptive enabled.

The LoomClient handles transient failures transparently:

  • Exponential backoff: 100ms base → 5s cap with ±25% jitter
  • Max retries: 3 (configurable)
  • Leader election handling: Retries every 100ms for up to 15 seconds
  • RESPONSE_REDIRECT: Cached leader address updated transparently
LoomClient client = LoomClient.builder()
.addSeed("127.0.0.1:5701")
.maxRetries(3)
.retryBaseDelay(Duration.ofMillis(100))
.build();

While the circuit breaker is Open, LoomCache runs background health probes (Half-Open state). Once a node comes back online and stabilizes, traffic is seamlessly allowed back through the primary connection pool routes.

The connection pool itself uses LIFO ordering per node (5–20 connections default) with idle eviction after 5 minutes, ensuring recently-verified connections are reused first.

The discovery subsystem includes multiple fault-tolerance features:

  • Retry backoff: Exponential backoff (100ms → 500ms) for DNS/discovery failures
  • TTL-aware caching: Retain discovered addresses during transient DNS outages
  • Health checking: Background TCP connectivity probes on cached addresses (every 30s)
  • Composite discovery: Fall through multiple strategies (static → DNS → multicast)

The Raft consensus layer itself provides foundational resilience:

  • Pre-vote: Prevents partitioned nodes from disrupting stable clusters with stale terms
  • Leader lease: Serves linearizable reads without quorum round-trips
  • Snapshot + WAL replay: Full crash recovery from disk
  • Dynamic membership: Add/remove nodes without downtime via configuration changes
  • Graceful shutdown: 8-phase drain with leadership transfer via TimeoutNow RPC