Resilience & Fault Tolerance

In distributed systems, failures are guaranteed. Networks drop packets, JVMs pause for garbage collection, and entire availability zones go offline. LoomCache is designed with active and passive resilience mechanisms to survive the chaos.

Resilience Circuit Breaker

LoomCache automatically isolates failing partitions. Connection pools fail fast to prevent cascading timeouts across the cluster.

CLOSED (Normal)

Traffic flows freely to the node.

OPEN (Failing)

Node isolated. Requests immediately fail fast.

HALF-OPEN (Recovery)

Testing node health with limited probe traffic.

Client

ReplicaOnline

Per-Peer Circuit Breakers

Phi-Accrual Failure Detector

Dynamic probability (Φ) instead of static timeouts

Node A

Observer

Node B

Target

Suspicion Level (Φ)

0.00

THRESHOLD (8.0)

State: ALIVE

Network: HEALTHY

When the cluster detects that a node is not responding to heartbeats (using a Phi-Accrual failure detector), the connection pool actively trips the circuit breaker for that specific route.

LoomCache implements per-peer circuit breaker isolation, meaning a single failing node cannot cascade failures to healthy routes. Each peer connection has its own independent breaker with three states:

State	Behavior
CLOSED	Normal operation. Failures counted.
OPEN	All requests instantly rejected. Timer starts.
HALF_OPEN	Limited probe requests (default: 3). Success → CLOSED, failure → OPEN.

Configuration

CircuitBreakerConfig.builder()
    .enabled(true)
    .failureThreshold(5)           // 5 consecutive failures → OPEN
    .recoveryTimeoutMs(30000)      // 30s before HALF_OPEN
    .halfOpenMaxRequests(3)        // 3 probes in HALF_OPEN
    .build();

This mechanism provides critical fail-fast semantics. Instead of thread pools saturating while waiting on TCP timeouts for a dead node, the LoomClient instantly rejects the request, protecting the upstream application.

Phi-Accrual Failure Detection

LoomCache uses Phi-Accrual Failure Detectors (inspired by Akka) instead of simple heartbeat timeouts. The detector tracks inter-arrival times of heartbeats and computes a suspicion level (φ) based on statistical deviation:

φ < threshold: Node is healthy
φ ≥ threshold: Node is suspected failed

This approach adapts automatically to network conditions — fewer false positives in high-latency environments (WAN/cloud) compared to fixed-timeout detection.

Setting	Default	Description
`phiAccrualThreshold`	`8.0`	Suspicion threshold. Higher = fewer false positives.
`enableAdaptiveThreshold`	`true`	Auto-adjust for observed latency variance.

Tuning: LAN environments use threshold 8.0 with adaptive disabled. WAN/cloud deployments benefit from 8.5–9.0 with adaptive enabled.

Client Retry with Exponential Backoff

The LoomClient handles transient failures transparently:

Exponential backoff: 100ms base → 5s cap with ±25% jitter
Max retries: 3 (configurable)
Leader election handling: Retries every 100ms for up to 15 seconds
RESPONSE_REDIRECT: Cached leader address updated transparently

LoomClient client = LoomClient.builder()
    .addSeed("127.0.0.1:5701")
    .maxRetries(3)
    .retryBaseDelay(Duration.ofMillis(100))
    .build();

Connection Pool Recovery

While the circuit breaker is Open, LoomCache runs background health probes (Half-Open state). Once a node comes back online and stabilizes, traffic is seamlessly allowed back through the primary connection pool routes.

The connection pool itself uses LIFO ordering per node (5–20 connections default) with idle eviction after 5 minutes, ensuring recently-verified connections are reused first.

Discovery Resilience

The discovery subsystem includes multiple fault-tolerance features:

Retry backoff: Exponential backoff (100ms → 500ms) for DNS/discovery failures
TTL-aware caching: Retain discovered addresses during transient DNS outages
Health checking: Background TCP connectivity probes on cached addresses (every 30s)
Composite discovery: Fall through multiple strategies (static → DNS → multicast)

Raft-Level Resilience

The Raft consensus layer itself provides foundational resilience:

Pre-vote: Prevents partitioned nodes from disrupting stable clusters with stale terms
Leader lease: Serves linearizable reads without quorum round-trips
Snapshot + WAL replay: Full crash recovery from disk
Dynamic membership: Add/remove nodes without downtime via configuration changes
Graceful shutdown: 8-phase drain with leadership transfer via TimeoutNow RPC