Performance Tuning
LoomCache has not been independently benchmarked by a third party. Treat the release SLOs below as acceptance targets for production-like validation, then measure with your workload before committing an application-facing SLO.
Java Virtual Threads (Project Loom)
Millions of lightweight virtual threads multiplexed over a few OS Carrier Threads.
OS Thread 1
OS Thread 2
Release SLO Targets
Section titled “Release SLO Targets”LoomCache write latency is bound by Raft majority commit and WAL durability, not by a backup-ack shortcut. The baseline production target for a healthy cluster is:
| Target | Topology and workload | Objective |
|---|---|---|
MAP_PUT write latency | 3 members, same region/AZ, SSD/NVMe WAL, 256-byte values, 10 warm clients, steady 1,000 accepted writes/sec for 10 minutes | P99 at most 100 ms, P99.9 at most 250 ms, error rate at most 0.1%, and zero lost acknowledged writes |
MAP_GET leader-read latency | Same 3-member cluster, 10 warm clients, 5,000 reads/sec against a warmed 10,000-key map | P99 at most 25 ms and error rate at most 0.1% |
| 80/20 mixed workload | Same 3-member cluster, 80% MAP_GET, 20% MAP_PUT, 2,500 ops/sec | P99 at most 75 ms and error rate at most 0.1% |
The write SLO is valid only while the cluster is ACTIVE, no member is partitioned, no snapshot install is in progress,
WAL fsync latency is within the storage target, and command queues are not backpressured. During leader election,
partition healing, full snapshot transfer, or sustained backpressure, use the degradation matrix instead of the steady
state SLO.
Track the SLO with these metrics:
loomcache.raft.commit_latency_msfor Raft commit latency.- WAL fsync logs plus
loomcache.raft.fsync_batch_sizeandloomcache.raft.snapshot_save_secondsfor storage tail symptoms. loomcache.command.queue_wait_nsand server-busy responses for saturation.- Client-side operation latency histograms from the application or load driver.
Baseline Benchmark
Section titled “Baseline Benchmark”The repository includes tagged benchmark suites under loom-integration-tests/src/test/java/com/loomcache/it/benchmark
and loom-integration-tests/src/test/java/com/loomcache/it/performance. Use them as a repeatable baseline/trend gate:
mvn -pl loom-integration-tests \ -Dtags=benchmark \ -DparallelExcludedTags=performance,stress,chaos,flaky,serial \ -Dserial.skipITs=true \ -Dit.forkCount=1 \ -Dit.threadCount=1 \ -Dit.heap=4g \ verifyImportant: the Maven integration-test harness sets -Dloomcache.wal.disableFsync=true so it is useful for regression
trends, not final production SLO evidence. For release acceptance, run the same workload shape against a deployed
3-member cluster with production WAL settings, TLS/auth settings, JVM heap, disks, and network placement. Record the
hardware, JVM flags, value size, client count, achieved QPS, P50/P95/P99/P99.9, error rate, and the maximum WAL fsync
latency next to the release notes.
The server requires Java 25+ with --enable-preview. The shipped Dockerfile uses:
--enable-preview-XX:+UseG1GC-XX:MaxGCPauseMillis=100-XX:+FlightRecorder-XX:FlightRecorderOptions=stackdepth=64-Xms512m -Xmx512mTuning advice:
- Keep
MaxGCPauseMillisbelow the RaftheartbeatIntervalMsto avoid spurious elections. - For larger heaps (> 8 GiB) add
-XX:+AlwaysPreTouchto avoid first-touch page faults. - Virtual threads don’t benefit from bigger stacks — leave defaults.
Virtual threads
Section titled “Virtual threads”TcpServer uses a virtual-thread-per-connection model. When a thread blocks on socket I/O or fsync, the JVM
unmounts it from the carrier. This means hand-rolled reactive glue is unnecessary — straight-line blocking code
handles thousands of connections per node.
Raft timing
Section titled “Raft timing”Set on ClusterConfig:
| Property | Default | Notes |
|---|---|---|
heartbeatIntervalMs | 2000 | Leader → follower heartbeats and peer pings. |
heartbeatTimeoutMs | 6000 | A peer unseen this long is considered gone. |
idempotencyTtlMs | 60_000 | Dedup cache retention — raise for slow clients. |
Spring Boot: loomcache.server.raft.election-timeout-ms, loomcache.server.raft.heartbeat-interval-ms.
Internal Raft defaults (see RaftNode.java) include pre-vote, randomized election timeouts, leader lease, and
ReadIndex — all on by default.
Client
Section titled “Client”On LoomClient.Builder:
| Setting | Default | Notes |
|---|---|---|
connectionTimeout | 5 s | Minimum 100 ms. |
requestTimeout | 15 s | Per-call. |
maxRetries | 3 | Non-negative. |
retryBaseDelay | 100 ms | Exponential × 2 with ±25 % jitter, capped at 5 s. |
nearCacheEnabled | true | Server push + TTL. |
nearCacheTtl | 30 s | Fallback TTL. |
nearCacheMaxSize | 10 000 | Client-local LRU cap; not server max-idle parity. |
Pool tuning lives in ConnectionPool / MultiplexedConnectionPool.
Near cache
Section titled “Near cache”Client-side LRU with server-push invalidation (NEAR_CACHE_INVALIDATE) and sequence tracking
(InvalidationSequenceTracker). Disable for write-heavy maps — every write invalidates all subscribers. This is a
local cache policy only; server-side LRU/LFU/FIFO/RANDOM, finite max-entry/max-memory eviction, and max-idle remain
unsupported for production until eviction decisions are Raft-applied and proven through WAL/snapshot/restart tests.
Capacity sizing
Section titled “Capacity sizing”Size by heap budget first, then entries, listener registrations, query execution metadata, and named data-structure
count. The default server JVM in the Docker image uses -Xmx512m; production nodes should set an explicit heap and
leave headroom for Raft, WAL buffers, serializers, metrics, TCP connections, and GC.
Use this worksheet per member:
usable_cache_heap = Xmx * 0.60entry_budget = local_entry_copies * estimated_entry_byteslistener_budget = listener_registrations * estimated_listener_registration_bytesquery_metadata_budget = bounded_query_working_set_bytesremaining_heap = usable_cache_heap - entry_budget - listener_budget - query_metadata_budgetsafe_instance_cap = floor(remaining_heap / measured_empty_instance_bytes)Keep the resulting safe_instance_cap at or below DataStructureRegistry.maxInstancesPerType (default 10_000 per
data-structure type, exported as loomcache_datastructures_max_instances_per_type). If remaining_heap is negative,
reduce entries, listeners, query metadata, or the instance count before increasing traffic. CREATE INDEX and
declarative SQL indexes are unsupported/rejected in this release, so production query budgets must not assume index
acceleration.
Per-key entry model
Section titled “Per-key entry model”DistributedMap tracks estimated live map memory through getCurrentMemoryBytes() and enforces
maxMemoryBytesPerMap when configured. The runtime estimate for each map entry is:
| Component | Bytes used by the built-in estimator |
|---|---|
| Entry/container overhead | 48 |
| String key or value | 16 + 3 * char_count |
| Non-string key or value | 64 until measured more precisely |
For cluster sizing, count stored copies, not only logical keys. The production-supported single-Raft-group path is
full replication: every member keeps each committed entry. A steady-state cluster keeps about
logical_entries * member_count entry copies before temporary migration headroom. Per member, start with:
local_entry_copies = logical_entries * 1.20The 1.20 factor reserves imbalance and migration headroom. Raise it for skewed keys or long migrations. For a map
with 32-character string keys and 256-character string values, the built-in estimate is
48 + (16 + 3*32) + (16 + 3*256) = 944 bytes per local copy before listener fan-out.
Per-listener model
Section titled “Per-listener model”Listeners add registration state and delivery work, not per-key storage. Model them separately:
| Listener type | Stored state to budget |
|---|---|
| Remote entry listener | One map-to-peer subscription plus one peer-to-map subscription in DistributedEventListenerManager. |
| Predicate entry listener | Remote entry listener state plus one predicate subscription holding the peer id and predicate object. |
| Embedded map listener | One CopyOnWriteArrayList registration holding the listener reference and optional predicate. |
| Continuous Query Cache listener | A remote listener plus a client-side cached view of the matching entries. |
| Topic subscriber | One subscriber entry per subscription, plus optional filtered-subscriber state and executor state when multi-threading is enabled. |
Use heap-delta measurements in staging for estimated_listener_registration_bytes; the dominant cost is usually the
application listener or predicate capture, not the registry entry itself. Listener fan-out also multiplies mutation CPU
and network writes, so capacity tests should include the expected listener count even when heap looks comfortable.
Data-structure count versus RAM
Section titled “Data-structure count versus RAM”The default maxInstancesPerType = 10_000 is a safety ceiling, not a promise that every heap can run 10,000 active
maps, queues, topics, sets, ringbuffers, and CRDTs with entries and listeners. Empty-instance allowance for that ceiling
looks like this before entries/listeners/query working sets:
Heap (Xmx) | 60% usable cache heap | Allowance per empty instance at 10,000 instances |
|---|---|---|
| 512 MiB | 307 MiB | about 31 KiB |
| 2 GiB | 1.2 GiB | about 126 KiB |
| 8 GiB | 4.8 GiB | about 503 KiB |
Measure empty-instance heap delta for the specific data structures you use, then cap names to the lower of the
measured safe_instance_cap and 10_000. When running an embedded node, lower the guardrail with
DataStructureRegistry.setMaxInstancesPerType(...) during bootstrap. When running the stock server, enforce
application-level naming quotas and alert on loomcache_datastructures_max_instances_per_type together with data
structure count, memory, and listener-count metrics.
Coalescing & backpressure
Section titled “Coalescing & backpressure”ReadCoalescingFilter+RequestCoalescerdeduplicate concurrent reads per key on the server.BackpressureControlleremitsRESPONSE_SERVER_BUSYwhen the command queue fills; clients back off.RateLimiterandPerClientRateLimitercap QPS globally and per client.
Metrics worth alerting on
Section titled “Metrics worth alerting on”loomcache.raft.commit_latency_ms— Raft health.- WAL fsync logs plus
loomcache.raft.fsync_batch_size/loomcache.raft.snapshot_save_seconds— disk bottlenecks. loomcache.command.queue_wait_ns— backpressure proximity.loom.connection.pool.waiters— pool size too small.tls.cert.expiration.days— rotate beforecertExpirationCriticalDays.
For Spring Boot and the Kubernetes manifests, scrape https://<node>:9090/actuator/prometheus. Direct
CacheNodeMain deployments with the standalone metrics listener can scrape http://<node>:9090/metrics. Sample
Grafana dashboards live in grafana/.