Monitoring & Observability
LoomCache ships several observability surfaces: Micrometer → Prometheus, JMX MXBeans, an in-process dashboard, a distributed tracer with an optional OpenTelemetry bridge, a slow-operation detector, and a slow-query log.
Metrics
Section titled “Metrics”Spring Boot deployments expose Prometheus on the Actuator endpoint. The Kubernetes manifests publish service port
9090 to that HTTPS listener, so scrape https://<node>:9090/actuator/prometheus when using those manifests.
Direct CacheNodeMain deployments can instead enable the standalone metrics listener on
http://<node>:9090/metrics.
Scrape config
Section titled “Scrape config”global: scrape_interval: 15s
scrape_configs: - job_name: 'loomcache' scheme: https metrics_path: '/actuator/prometheus' static_configs: - targets: - 'loomcache-node1:9090' - 'loomcache-node2:9090' - 'loomcache-node3:9090'Alert rules
Section titled “Alert rules”Example Prometheus alerting rules ship at prometheus/loomcache.rules.yml. Add the file to Prometheus with:
rule_files: - prometheus/loomcache.rules.ymlThe sample rules cover missing/multiple Raft leaders, replication lag, state-apply errors, low cache hit rate, WAL/snapshot validation failures, TCP send errors, rejected handshakes, slow operations, and contained boundary exceptions.
Core metric families
Section titled “Core metric families”- Requests — request counts / latencies / errors, broken down by opcode.
- Raft — current term, leader changes, append-entries latency, log sizes, commit latency.
- Persistence — WAL append rate, fsync latency, snapshot duration.
- Cluster — node count, partitioned peers, membership churn.
- Connections — pool utilization, TCP errors, per-client rate-limiter decisions.
- Data structures — size, memory, listener counts per named instance, and
loomcache_datastructures_max_instances_per_type, the Prometheus-exported per-type instance cap. - Partition migration — migration bytes/keys/duration plus departed-owner backup-promotion count/time.
- Slow operations — active slow-detector scopes plus detected count and duration histograms.
- TLS —
tls.cert.expiration.days— alert belowcertExpirationCriticalDays.
JMX MXBeans
Section titled “JMX MXBeans”LoomCacheMXBean(loom-server/.../metrics/LoomCacheMXBean.java) — per-node cache stats and Raft health.LoomClusterMXBean(loom-server/.../metrics/LoomClusterMXBean.java) — cluster membership, operational state, Raft group summaries, and aggregate cache counters from the member’s live cluster view.JCacheStatisticsMXBean,JCacheConfigMXBean— JCache (JSR-107) scaffolding surfaces.
Enable JMX remote in JAVA_OPTS to scrape from JConsole / VisualVM / Flight Recorder.
Health
Section titled “Health”HealthCheckServer (loom-server/.../health/HealthCheckServer.java) serves liveness / readiness endpoints suitable
for Kubernetes probes. The Spring Boot starter additionally exposes LoomHealthIndicator via the Spring Actuator.
Dashboard
Section titled “Dashboard”RestApiServer + RestApiRouter mount:
/dashboard— HTML dashboard (served byManagementApiHandler+DashboardHtml)./api/v1/management/*— JSON management endpoints backing the dashboard.
The dashboard includes a cluster view backed by /api/v1/management/topology: cluster ID, local member, leader, Raft
term/commit index, operational state, member liveness, addresses, and member attributes. Use Grafana/Prometheus for
longer-term retention, alerting, and multi-cluster dashboards.
Grafana
Section titled “Grafana”Ready-to-import dashboard JSON files ship in grafana/:
grafana/loomcache-overview.json— cluster health, hit rate, command latency, TCP activity, and Raft summary panels.grafana/loomcache-raft.json— leader status, elections, replication lag, follower match indexes, and snapshot/WAL panels.grafana/loomcache-data-structures.json— entries, hits, misses, evictions, WAL/fsync, and data-structure trends.
grafana/provisioning/dashboards/loomcache-dashboards.yaml can be mounted into Grafana’s provisioning directory when
dashboards should be installed automatically instead of imported through the UI.
Client Statistics
Section titled “Client Statistics”Clients can opt into periodic member telemetry with LoomClient.Builder.clientStatisticsEnabled(true) or
loomcache.client.statistics.enabled=true in Spring Boot. Each upload carries JVM/process, connection, and near-cache
counters in a compact binary payload; members retain the latest snapshot per client id for local management surfaces.
Distributed tracing
Section titled “Distributed tracing”loom-server/.../tracing:
SpanManager— span lifecycle.TraceSampler— configurable sampling.InMemoryTraceExporter— default exporter; useful for tests.TraceAnalyzer— after-the-fact analysis.TracingInterceptor— adds spans around incoming commands.TraceContextPropagator— propagates trace context across Raft messages and Kafka-like queues.
Optional bridge to OpenTelemetry lives in loom-server/.../observability
(OpenTelemetryBridge, ReflectiveOpenTelemetryBridge). Enabled automatically when the OTel API is on the
classpath.
Slow-query log
Section titled “Slow-query log”SlowQueryLog (loom-server/.../query/SlowQueryLog.java) captures SQL queries that exceed the configured threshold.
Useful when tuning the SQL engine or adding indexes.
Slow-operation detector
Section titled “Slow-operation detector”SlowOperationDetector (loom-server/.../observability/SlowOperationDetector.java) is an opt-in diagnostic plugin for
latency incidents. It wraps both direct TCP command handling and the optional pipelined executor, samples active
operation threads, records operations above the configured threshold, and retains a bounded in-memory ring of recent
records with sampled stack frames.
Properties/YAML nodes can enable it with:
loomcache.diagnostics.slow-operation.enabled=trueloomcache.diagnostics.slow-operation.threshold-ms=10000loomcache.diagnostics.slow-operation.sample-interval-ms=1000loomcache.diagnostics.slow-operation.max-records=1024loomcache.diagnostics.slow-operation.max-stack-frames=32Spring Boot embedded nodes use the equivalent loomcache.server.diagnostics.slow-operation.* properties.
Repeated warning suppression
Section titled “Repeated warning suppression”FrequentLogSuppressor (loom-server/.../logging/FrequentLogSuppressor.java) keeps recovery logs readable during CP
subsystem reset fallout. The first warning for a repeated condition is emitted immediately; repeats inside the
suppression window are counted and summarized on the next emitted warning. It is currently applied to expired-session
CP resource cleanup and stale fencing-token validation warnings.
Cluster health monitor
Section titled “Cluster health monitor”ClusterHealthMonitor (loom-server/.../cluster) tracks heartbeat timing, phi-accrual scores, and node status. The
/api/v1/cluster/health REST endpoint (and Spring Boot /api/cluster/health) exposes its current view.