Skip to content

Chaos Testing

LoomCache ships a self-contained Java chaos-testing framework under loom-server/src/test/java/com/loomcache/server/chaos. There is no external dependency or separate runtime — histories, checkers, and nemeses are all pure Java.

A run:

  1. Generates operations — concurrent clients drive reads/writes/CAS against the cluster (ChaosWorkload produces Register / Counter / Queue / Set / Mixed workloads; ChaosClient / ChaosRealClient drive them).
  2. Injects faultsChaosNemesis is a sealed fault family: symmetric partition, node isolation, message reorder, kill node / kill leader / pause node, clock skew, slow disk (message-send delay), memory pressure, and CPU contention, composable via ChaosNemesis.Combined.
  3. Records a historyChaosHistory captures invocation and completion timestamps.
  4. Verifies consistencyWglLinearizabilityChecker (WGL = Wing-Gong-Lowe search) or the per-model ChaosChecker checkers validate the history.
  • ChaosTestHarness — wires workload + nemesis + cluster into a single executable test.
  • ChaosCluster / ChaosRealCluster — both start real CacheNode clusters (Raft + TCP) with fault hooks.
  • ChaosClient / ChaosRealClient — op generators.
  • ChaosWorkload — op generators: Register / Counter / Queue / Set / Mixed workloads.
  • ChaosNemesis — sealed fault family (see Methodology for the fault list).
  • ChaosHistory / ChaosReport — history and reporting.
  • ChaosChecker / WglLinearizabilityCheckerChaosChecker exposes the per-model checkers; WglLinearizabilityChecker is the WGL (Wing-Gong-Lowe) search over register / counter / queue / set.
  • model/Operation, CounterModel, QueueModel, LockModel, SetModel.

There is no per-scenario file tree under tests/. The harness is exercised by exactly two test classes, both in loom-server test sources:

  • ChaosFrameworkEnhancedTest (no chaos tag) — drives the framework primitives: WGL register/counter/queue/set checks (positive, violation, and malformed-input cases), the per-model ChaosChecker checkers, every ChaosWorkload generator, ChaosHistory recording/concurrency, ChaosReport summary/latency/fault-timeline output, the ChaosNemesis fault types against a recording cluster double, and one end-to-end ChaosTestHarness run that starts a real 3-node CacheNode cluster.
  • tests/RealClusterLinearizabilityTest (@Tag("chaos")) — starts a real 3-node ChaosRealCluster and asserts register linearizability via WglLinearizabilityChecker over concurrent leader-local / per-node map operations.

ChaosNemesis is a sealed interface; all faults compose via ChaosNemesis.Combined:

  • NetworkSymmetricPartition, IsolateNode, MessageReorder.
  • ProcessKillNode, KillLeader, PauseNode.
  • Resource / timingClockSkew (simulates clock-skew effects by calling cluster.pauseNode() — no actual system clock manipulation), SlowDisk (message-send delay), MemoryPressure, CpuContention.
  • The real-cluster runs (ChaosTestHarness / ChaosRealCluster) exercise actual Raft and TCP via real CacheNode instances, which run the WAL and snapshot machinery as part of normal startup.
  • The framework-primitive tests feed hand-built histories straight into the checkers — they verify checker/history logic without touching the network stack.
  • WglLinearizabilityChecker is a Wing-Gong-Lowe linearization search supporting the register, counter, queue, and set data types only (it rejects any other type) and carries its own internal model state. The search is bounded by MAX_SEARCH_STATES = 500_000; when that limit is reached the checker conservatively reports a violation rather than returning a heuristic pass.
  • The per-model ChaosChecker checkers cover the rest: LinearizableRegister, LinearizableCounter (CounterModel), LinearizableQueue (QueueModel), LinearizableSet (SetModel), and MutualExclusion (LockModel, fence-token + double-lock checks).
  • RealClusterLinearizabilityTest lives in loom-server test sources (not loom-integration-tests); it boots three real CacheNode instances per test, reserving fork-scoped TCP ports to stay parallel-safe under forkCount=64 / threadCount=4.
Terminal window
# Framework-primitive + one end-to-end harness run (no chaos tag, runs by default):
mvn -pl loom-server test -Dut.forkCount=64 -Dut.threadCount=4 -Dit.forkCount=64 -Dit.threadCount=4 -Dtest=ChaosFrameworkEnhancedTest
# Real 3-node cluster register-linearizability run. RealClusterLinearizabilityTest is @Tag("chaos"),
# which the default unit lane excludes (ut.excludedGroups=benchmark,chaos), so opt in with -Dgroups=chaos
# and override excludedGroups to keep only benchmark excluded:
mvn -pl loom-server test -Dut.forkCount=64 -Dut.threadCount=4 -Dit.forkCount=64 -Dit.threadCount=4 -Dgroups=chaos -Dut.excludedGroups=benchmark -Dtest=RealClusterLinearizabilityTest

The framework-primitive checks complete in seconds. The real-cluster runs (ChaosTestHarness inside ChaosFrameworkEnhancedTest, and RealClusterLinearizabilityTest) boot real CacheNode instances and elect a Raft leader, so they take longer; per-client op joins use a 30 s ceiling.

ChaosReport emits a human-readable summary plus a full history trace for failing runs. Replay the trace through WglLinearizabilityChecker to reproduce and debug locally.

For the broader correctness story (Raft invariants, durability, near-cache coherence), see the architecture overview.