Skip to content

Performance Tuning

LoomCache has not been independently benchmarked by a third party. Treat the release SLOs below as acceptance targets for production-like validation, then measure with your workload before committing an application-facing SLO.

Java Virtual Threads (Project Loom)

Millions of lightweight virtual threads multiplexed over a few OS Carrier Threads.

VT
VT
VT

OS Thread 1

Carrier

OS Thread 2

Carrier

LoomCache write latency is bound by Raft majority commit and WAL durability, not by a backup-ack shortcut. The baseline production target for a healthy cluster is:

TargetTopology and workloadObjective
MAP_PUT write latency3 members, same region/AZ, SSD/NVMe WAL, 256-byte values, 10 warm clients, steady 1,000 accepted writes/sec for 10 minutesP99 at most 100 ms, P99.9 at most 250 ms, error rate at most 0.1%, and zero lost acknowledged writes
MAP_GET leader-read latencySame 3-member cluster, 10 warm clients, 5,000 reads/sec against a warmed 10,000-key mapP99 at most 25 ms and error rate at most 0.1%
80/20 mixed workloadSame 3-member cluster, 80% MAP_GET, 20% MAP_PUT, 2,500 ops/secP99 at most 75 ms and error rate at most 0.1%

The write SLO is valid only while the cluster is ACTIVE, no member is partitioned, no snapshot install is in progress, WAL fsync latency is within the storage target, and command queues are not backpressured. During leader election, partition healing, full snapshot transfer, or sustained backpressure, use the degradation matrix instead of the steady state SLO.

Track the SLO with these metrics:

  • loomcache.raft.commit_latency_ms for Raft commit latency.
  • WAL fsync logs plus loomcache.raft.fsync_batch_size and loomcache.raft.snapshot_save_seconds for storage tail symptoms.
  • loomcache.command.queue_wait_ns and server-busy responses for saturation.
  • Client-side operation latency histograms from the application or load driver.

The repository includes tagged benchmark suites under loom-integration-tests/src/test/java/com/loomcache/it/benchmark and loom-integration-tests/src/test/java/com/loomcache/it/performance. Use them as a repeatable baseline/trend gate:

Terminal window
mvn -pl loom-integration-tests \
-Dtags=benchmark \
-DparallelExcludedTags=performance,stress,chaos,flaky,serial \
-Dserial.skipITs=true \
-Dit.forkCount=1 \
-Dit.threadCount=1 \
-Dit.heap=4g \
verify

Important: the Maven integration-test harness sets -Dloomcache.wal.disableFsync=true so it is useful for regression trends, not final production SLO evidence. For release acceptance, run the same workload shape against a deployed 3-member cluster with production WAL settings, TLS/auth settings, JVM heap, disks, and network placement. Record the hardware, JVM flags, value size, client count, achieved QPS, P50/P95/P99/P99.9, error rate, and the maximum WAL fsync latency next to the release notes.

The server requires Java 25+ with --enable-preview. The shipped Dockerfile uses:

--enable-preview
-XX:+UseG1GC
-XX:MaxGCPauseMillis=100
-XX:+FlightRecorder
-XX:FlightRecorderOptions=stackdepth=64
-Xms512m -Xmx512m

Tuning advice:

  • Keep MaxGCPauseMillis below the Raft heartbeatIntervalMs to avoid spurious elections.
  • For larger heaps (> 8 GiB) add -XX:+AlwaysPreTouch to avoid first-touch page faults.
  • Virtual threads don’t benefit from bigger stacks — leave defaults.

TcpServer uses a virtual-thread-per-connection model. When a thread blocks on socket I/O or fsync, the JVM unmounts it from the carrier. This means hand-rolled reactive glue is unnecessary — straight-line blocking code handles thousands of connections per node.

Set on ClusterConfig:

PropertyDefaultNotes
heartbeatIntervalMs2000Leader → follower heartbeats and peer pings.
heartbeatTimeoutMs6000A peer unseen this long is considered gone.
idempotencyTtlMs60_000Dedup cache retention — raise for slow clients.

Spring Boot: loomcache.server.raft.election-timeout-ms, loomcache.server.raft.heartbeat-interval-ms.

Internal Raft defaults (see RaftNode.java) include pre-vote, randomized election timeouts, leader lease, and ReadIndex — all on by default.

On LoomClient.Builder:

SettingDefaultNotes
connectionTimeout5 sMinimum 100 ms.
requestTimeout15 sPer-call.
maxRetries3Non-negative.
retryBaseDelay100 msExponential × 2 with ±25 % jitter, capped at 5 s.
nearCacheEnabledtrueServer push + TTL.
nearCacheTtl30 sFallback TTL.
nearCacheMaxSize10 000Client-local LRU cap; not server max-idle parity.

Pool tuning lives in ConnectionPool / MultiplexedConnectionPool.

Client-side LRU with server-push invalidation (NEAR_CACHE_INVALIDATE) and sequence tracking (InvalidationSequenceTracker). Disable for write-heavy maps — every write invalidates all subscribers. This is a local cache policy only; server-side LRU/LFU/FIFO/RANDOM, finite max-entry/max-memory eviction, and max-idle remain unsupported for production until eviction decisions are Raft-applied and proven through WAL/snapshot/restart tests.

Size by heap budget first, then entries, listener registrations, query execution metadata, and named data-structure count. The default server JVM in the Docker image uses -Xmx512m; production nodes should set an explicit heap and leave headroom for Raft, WAL buffers, serializers, metrics, TCP connections, and GC.

Use this worksheet per member:

usable_cache_heap = Xmx * 0.60
entry_budget = local_entry_copies * estimated_entry_bytes
listener_budget = listener_registrations * estimated_listener_registration_bytes
query_metadata_budget = bounded_query_working_set_bytes
remaining_heap = usable_cache_heap - entry_budget - listener_budget - query_metadata_budget
safe_instance_cap = floor(remaining_heap / measured_empty_instance_bytes)

Keep the resulting safe_instance_cap at or below DataStructureRegistry.maxInstancesPerType (default 10_000 per data-structure type, exported as loomcache_datastructures_max_instances_per_type). If remaining_heap is negative, reduce entries, listeners, query metadata, or the instance count before increasing traffic. CREATE INDEX and declarative SQL indexes are unsupported/rejected in this release, so production query budgets must not assume index acceleration.

DistributedMap tracks estimated live map memory through getCurrentMemoryBytes() and enforces maxMemoryBytesPerMap when configured. The runtime estimate for each map entry is:

ComponentBytes used by the built-in estimator
Entry/container overhead48
String key or value16 + 3 * char_count
Non-string key or value64 until measured more precisely

For cluster sizing, count stored copies, not only logical keys. The production-supported single-Raft-group path is full replication: every member keeps each committed entry. A steady-state cluster keeps about logical_entries * member_count entry copies before temporary migration headroom. Per member, start with:

local_entry_copies = logical_entries * 1.20

The 1.20 factor reserves imbalance and migration headroom. Raise it for skewed keys or long migrations. For a map with 32-character string keys and 256-character string values, the built-in estimate is 48 + (16 + 3*32) + (16 + 3*256) = 944 bytes per local copy before listener fan-out.

Listeners add registration state and delivery work, not per-key storage. Model them separately:

Listener typeStored state to budget
Remote entry listenerOne map-to-peer subscription plus one peer-to-map subscription in DistributedEventListenerManager.
Predicate entry listenerRemote entry listener state plus one predicate subscription holding the peer id and predicate object.
Embedded map listenerOne CopyOnWriteArrayList registration holding the listener reference and optional predicate.
Continuous Query Cache listenerA remote listener plus a client-side cached view of the matching entries.
Topic subscriberOne subscriber entry per subscription, plus optional filtered-subscriber state and executor state when multi-threading is enabled.

Use heap-delta measurements in staging for estimated_listener_registration_bytes; the dominant cost is usually the application listener or predicate capture, not the registry entry itself. Listener fan-out also multiplies mutation CPU and network writes, so capacity tests should include the expected listener count even when heap looks comfortable.

The default maxInstancesPerType = 10_000 is a safety ceiling, not a promise that every heap can run 10,000 active maps, queues, topics, sets, ringbuffers, and CRDTs with entries and listeners. Empty-instance allowance for that ceiling looks like this before entries/listeners/query working sets:

Heap (Xmx)60% usable cache heapAllowance per empty instance at 10,000 instances
512 MiB307 MiBabout 31 KiB
2 GiB1.2 GiBabout 126 KiB
8 GiB4.8 GiBabout 503 KiB

Measure empty-instance heap delta for the specific data structures you use, then cap names to the lower of the measured safe_instance_cap and 10_000. When running an embedded node, lower the guardrail with DataStructureRegistry.setMaxInstancesPerType(...) during bootstrap. When running the stock server, enforce application-level naming quotas and alert on loomcache_datastructures_max_instances_per_type together with data structure count, memory, and listener-count metrics.

  • ReadCoalescingFilter + RequestCoalescer deduplicate concurrent reads per key on the server.
  • BackpressureController emits RESPONSE_SERVER_BUSY when the command queue fills; clients back off.
  • RateLimiter and PerClientRateLimiter cap QPS globally and per client.
  • loomcache.raft.commit_latency_ms — Raft health.
  • WAL fsync logs plus loomcache.raft.fsync_batch_size / loomcache.raft.snapshot_save_seconds — disk bottlenecks.
  • loomcache.command.queue_wait_ns — backpressure proximity.
  • loom.connection.pool.waiters — pool size too small.
  • tls.cert.expiration.days — rotate before certExpirationCriticalDays.

For Spring Boot and the Kubernetes manifests, scrape https://<node>:9090/actuator/prometheus. Direct CacheNodeMain deployments with the standalone metrics listener can scrape http://<node>:9090/metrics. Sample Grafana dashboards live in grafana/.