Skip to content

Monitoring & Observability

LoomCache ships several observability surfaces: Micrometer → Prometheus, JMX MXBeans, an in-process dashboard, a distributed tracer with an optional OpenTelemetry bridge, a slow-operation detector, and a slow-query log.

Spring Boot deployments expose Prometheus on the Actuator endpoint. The Kubernetes manifests publish service port 9090 to that HTTPS listener, so scrape https://<node>:9090/actuator/prometheus when using those manifests. Direct CacheNodeMain deployments can instead enable the standalone metrics listener on http://<node>:9090/metrics.

global:
scrape_interval: 15s
scrape_configs:
- job_name: 'loomcache'
scheme: https
metrics_path: '/actuator/prometheus'
static_configs:
- targets:
- 'loomcache-node1:9090'
- 'loomcache-node2:9090'
- 'loomcache-node3:9090'

Example Prometheus alerting rules ship at prometheus/loomcache.rules.yml. Add the file to Prometheus with:

rule_files:
- prometheus/loomcache.rules.yml

The sample rules cover missing/multiple Raft leaders, replication lag, state-apply errors, low cache hit rate, WAL/snapshot validation failures, TCP send errors, rejected handshakes, slow operations, and contained boundary exceptions.

  • Requests — request counts / latencies / errors, broken down by opcode.
  • Raft — current term, leader changes, append-entries latency, log sizes, commit latency.
  • Persistence — WAL append rate, fsync latency, snapshot duration.
  • Cluster — node count, partitioned peers, membership churn.
  • Connections — pool utilization, TCP errors, per-client rate-limiter decisions.
  • Data structures — size, memory, listener counts per named instance, and loomcache_datastructures_max_instances_per_type, the Prometheus-exported per-type instance cap.
  • Partition migration — migration bytes/keys/duration plus departed-owner backup-promotion count/time.
  • Slow operations — active slow-detector scopes plus detected count and duration histograms.
  • TLStls.cert.expiration.days — alert below certExpirationCriticalDays.
  • LoomCacheMXBean (loom-server/.../metrics/LoomCacheMXBean.java) — per-node cache stats and Raft health.
  • LoomClusterMXBean (loom-server/.../metrics/LoomClusterMXBean.java) — cluster membership, operational state, Raft group summaries, and aggregate cache counters from the member’s live cluster view.
  • JCacheStatisticsMXBean, JCacheConfigMXBean — JCache (JSR-107) scaffolding surfaces.

Enable JMX remote in JAVA_OPTS to scrape from JConsole / VisualVM / Flight Recorder.

HealthCheckServer (loom-server/.../health/HealthCheckServer.java) serves liveness / readiness endpoints suitable for Kubernetes probes. The Spring Boot starter additionally exposes LoomHealthIndicator via the Spring Actuator.

RestApiServer + RestApiRouter mount:

  • /dashboard — HTML dashboard (served by ManagementApiHandler + DashboardHtml).
  • /api/v1/management/* — JSON management endpoints backing the dashboard.

The dashboard includes a cluster view backed by /api/v1/management/topology: cluster ID, local member, leader, Raft term/commit index, operational state, member liveness, addresses, and member attributes. Use Grafana/Prometheus for longer-term retention, alerting, and multi-cluster dashboards.

Ready-to-import dashboard JSON files ship in grafana/:

  • grafana/loomcache-overview.json — cluster health, hit rate, command latency, TCP activity, and Raft summary panels.
  • grafana/loomcache-raft.json — leader status, elections, replication lag, follower match indexes, and snapshot/WAL panels.
  • grafana/loomcache-data-structures.json — entries, hits, misses, evictions, WAL/fsync, and data-structure trends.

grafana/provisioning/dashboards/loomcache-dashboards.yaml can be mounted into Grafana’s provisioning directory when dashboards should be installed automatically instead of imported through the UI.

Clients can opt into periodic member telemetry with LoomClient.Builder.clientStatisticsEnabled(true) or loomcache.client.statistics.enabled=true in Spring Boot. Each upload carries JVM/process, connection, and near-cache counters in a compact binary payload; members retain the latest snapshot per client id for local management surfaces.

loom-server/.../tracing:

  • SpanManager — span lifecycle.
  • TraceSampler — configurable sampling.
  • InMemoryTraceExporter — default exporter; useful for tests.
  • TraceAnalyzer — after-the-fact analysis.
  • TracingInterceptor — adds spans around incoming commands.
  • TraceContextPropagator — propagates trace context across Raft messages and Kafka-like queues.

Optional bridge to OpenTelemetry lives in loom-server/.../observability (OpenTelemetryBridge, ReflectiveOpenTelemetryBridge). Enabled automatically when the OTel API is on the classpath.

SlowQueryLog (loom-server/.../query/SlowQueryLog.java) captures SQL queries that exceed the configured threshold. Useful when tuning the SQL engine or adding indexes.

SlowOperationDetector (loom-server/.../observability/SlowOperationDetector.java) is an opt-in diagnostic plugin for latency incidents. It wraps both direct TCP command handling and the optional pipelined executor, samples active operation threads, records operations above the configured threshold, and retains a bounded in-memory ring of recent records with sampled stack frames.

Properties/YAML nodes can enable it with:

loomcache.diagnostics.slow-operation.enabled=true
loomcache.diagnostics.slow-operation.threshold-ms=10000
loomcache.diagnostics.slow-operation.sample-interval-ms=1000
loomcache.diagnostics.slow-operation.max-records=1024
loomcache.diagnostics.slow-operation.max-stack-frames=32

Spring Boot embedded nodes use the equivalent loomcache.server.diagnostics.slow-operation.* properties.

FrequentLogSuppressor (loom-server/.../logging/FrequentLogSuppressor.java) keeps recovery logs readable during CP subsystem reset fallout. The first warning for a repeated condition is emitted immediately; repeats inside the suppression window are counted and summarized on the next emitted warning. It is currently applied to expired-session CP resource cleanup and stale fencing-token validation warnings.

ClusterHealthMonitor (loom-server/.../cluster) tracks heartbeat timing, phi-accrual scores, and node status. The /api/v1/cluster/health REST endpoint (and Spring Boot /api/cluster/health) exposes its current view.