Monitoring & Observability

LoomCache ships several observability surfaces: Micrometer → Prometheus, JMX MXBeans, health checks, client statistics, slow-operation detection, and Grafana dashboards.

The animation below traces how telemetry leaves a node:

Observability Pipeline

Raft, Map, and Network emit metrics into the Micrometer registry.

LoomCache Node

Raft

Map

Network

Micrometer Registry

loomcache.raft.is_leader

loomcache.cache.hits

loomcache.tcp.connections.active

Prometheus/actuator/prometheus:8080 · 15s scrape

Grafanaloomcache-overview.json

JMX MXBeansLoomCacheMXBeanLoomClusterMXBean

JConsole / VisualVMJMX remote · JAVA_OPTS

Metrics

Spring Boot deployments

Spring Boot deployments expose Prometheus on the Actuator endpoint. Actuator (including /actuator/prometheus) is served on the Spring Boot web port server.port (8080 in the samples) over the same HTTPS listener as the REST API, so scrape https://<node>:8080/actuator/prometheus. The following scrape configuration targets a three-node cluster:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'loomcache'
    scheme: https
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets:
          - 'loomcache-node1:8080'
          - 'loomcache-node2:8080'
          - 'loomcache-node3:8080'

Standalone / embedded deployments

Standalone and embedded deployments should keep node-local observability endpoints private and publish production Prometheus telemetry through a protected Spring Boot management surface or an environment-owned exporter. Do not expose an unauthenticated node-local scrape listener on a routable interface. The direct metrics listener binds to localhost by default; remote scraping requires an explicit metrics bind address, metrics basic auth, and TLS termination or an approved plaintext network segment.

Alert rules

Example Prometheus alerting rules ship at prometheus/loomcache.rules.yml. Add the file to Prometheus with:

rule_files:
  - prometheus/loomcache.rules.yml

The sample rules cover missing/multiple Raft leaders, replication lag, state-apply errors, low cache hit rate, WAL/snapshot validation failures, TCP send errors, rejected handshakes, slow operations, contained boundary exceptions, cross-group 2PC recovery activity and orphan-commit give-ups, post-commit flush failures, and partition backup promotions. The 2PC and partition rules only fire when sharding/cross-group transactions are enabled.

Core metric families

Requests — request counts / latencies / errors broken down by request type: loomcache.command.parse_seconds, loomcache.command.queue_wait_seconds, loomcache.command.raft_commit_seconds, loomcache.command.apply_seconds.
Raft — loomcache.raft.current_term, loomcache.raft.commit_index, loomcache.raft.last_applied, loomcache.raft.log_size, loomcache.raft.is_leader, loomcache.raft.pending_requests, loomcache.raft.elections.total, loomcache.raft.commands.committed, loomcache.raft.state_apply.errors, loomcache.raft.replication.lag, loomcache.raft.commit_latency_seconds, loomcache.raft.replication_lag_seconds, loomcache.raft.follower_lag_ms (per follower peer), loomcache.raft.follower_match_index (per follower peer).
Persistence — loomcache.raft.snapshot_install_seconds, loomcache.raft.snapshot_save_seconds, loomcache.raft.snapshot_load_seconds, loomcache.raft.snapshot_validation_failures_total, loomcache.raft.snapshot_chunks_total, loomcache.raft.fsync_batch_size, loomcache.wal.crc_validation_failures_total.
Cluster — node count, partitioned peers, membership churn.
Connections — loomcache.tcp.connections.active, loomcache.tcp.messages.sent, loomcache.tcp.messages.received, loomcache.tcp.send.errors.
Data structures — loomcache.cache.entries, loomcache.cache.hits, loomcache.cache.misses, loomcache.cache.evictions (tagged by map; use map="__total" for aggregate dashboards and alerts), and loomcache_datastructures_max_instances_per_type (Prometheus wire form), the per-type instance cap.
Partition migration — loomcache.partition.migration.duration_seconds, loomcache.partition.migration.bytes, loomcache.partition.migration.keys, loomcache.partition.backup_promotion.count, loomcache.partition.backup_promotion.duration_seconds.
CP subsystem — loomcache.cp.active_locks, loomcache.cp.semaphore_permits_available, loomcache.cp.pending_operations.
Transactions / 2PC — loomcache.tx.crossgroup.commit_latency_seconds, loomcache.tx.crossgroup.prepare_duration_seconds, loomcache.tx.crossgroup.decide_duration_seconds, loomcache.tx.crossgroup.aborts (tagged by reason: vote_no, timeout, participant_unreachable, other), loomcache.tx.crossgroup.recoveries (tagged by cause), loomcache.tx.twopc.orphan_commit_giveups (participants that voted COMMIT but exhausted DECIDE_QUERY retries and entered manual-recovery state).
Indexes — loomcache.indexes.skipped_query_count, loomcache.indexes.no_matching_index_query_count, loomcache.indexes.partitions_indexed.
Handshake — loomcache.handshake.attempts, loomcache.handshake.failures, loomcache.handshake.rejected (Prometheus wire form: loomcache_handshake_rejected).
Batch — loomcache.batch.post_commit_flush_failures (tagged by structure = map/set/queue; the alertable surface for a buffered side-effect flush that failed after the atomic batch committed and the client ACKed. The in-memory and Raft state hold the write, but an external store may now be stale and need reconciliation).
Slow operations — active slow-detector scopes plus detected count and duration histograms.

LoomCache does not currently emit a certificate-expiration Prometheus gauge. The certExpirationWarningDays / certExpirationCriticalDays thresholds on TlsConfig are configuration only — track notAfter lifetimes externally and rotate before the critical window.

JMX MXBeans

LoomCacheMXBean — per-node cache stats and Raft health.
LoomClusterMXBean — cluster membership, operational state, Raft group summaries, and aggregate cache counters from the member’s live cluster view.
JCacheStatisticsMXBean, JCacheConfigMXBean — JCache (JSR-107) scaffolding surfaces.

Enable JMX remote in JAVA_OPTS to connect with JConsole, VisualVM, or Flight Recorder.

Health

Direct-node health

The direct-node health listener serves probe endpoints on port 5702 (HTTP) and optionally port 5703 (HTTPS/TLS). Both GET and HEAD are accepted. Docker/Spring Boot production templates set loomcache.admin.enabled=false and use Actuator health on 8080; enable the direct listener only for a private node-local diagnostics plane.

Path	Description
`/health`	Combined liveness + readiness: `{"status":"UP"}` (200) or `{"status":"DOWN"}` (503).
`/health/live`	Liveness-only probe: `{"status":"UP"}` or `{"status":"DOWN"}`.
`/ready`	Readiness-only probe: same shape as `/health/live`.
`/health/readiness`	Detailed readiness JSON: includes `subsystems.raft`, `subsystems.wal`, `subsystems.wan`, and (when requested over TLS with a verified mTLS client certificate) `subsystems.migrationQueue` and `subsystems.partitionOwners`. Requires TLS — HTTP requests to this path are rejected with `403 Forbidden`.

The Spring Boot starter additionally exposes LoomHealthIndicator via the Spring Actuator (/actuator/health/loomcache, named after the loomcache health-indicator bean); that path accepts HTTP from any allowed client per Spring Security configuration.

Spring Boot REST health endpoints

Path	Auth	Description
`GET /api/cluster/health`	public	Coarse `UP`/`DOWN` readiness status (node running, Raft running, leader known).
`GET /api/cluster/health/details`	ADMIN	Detailed cluster health JSON with per-subsystem states.
`GET /api/cluster/status`	ADMIN	Raft membership, term, leader, and commit index.

Grafana

Ready-to-import dashboard JSON files ship in grafana/:

grafana/loomcache-overview.json — cluster health, hit rate, command latency, TCP activity, and Raft summary panels.
grafana/loomcache-raft.json — leader status, elections, replication lag, follower match indexes, and snapshot/WAL panels.
grafana/loomcache-data-structures.json — entries, hits, misses, evictions, WAL/fsync, and data-structure trends.

grafana/provisioning/dashboards/loomcache-dashboards.yaml can be mounted into Grafana’s provisioning directory when dashboards should be installed automatically instead of imported through the UI. The provisioning config references the dashboard JSON files under …/provisioning/dashboards/loomcache/; when mounting, place the dashboard JSON files in that loomcache/ subdirectory relative to your Grafana provisioning root.

Client Statistics

Clients can opt into periodic member telemetry with LoomClient.Builder.clientStatisticsEnabled(true) or loomcache.client.statistics.enabled=true in Spring Boot. Each upload carries JVM/process, connection, and near-cache counters in a compact binary payload; members retain the latest snapshot per client id for local management surfaces.

Slow-operation detector

SlowOperationDetector is enabled by default (SlowOperationDetectorConfig.DEFAULT_ENABLED = true). It wraps both direct TCP command handling and the optional pipelined executor, samples active operation threads, records operations above the configured threshold, and retains a bounded in-memory ring of recent records with sampled stack frames.

The defaults are: enabled=true, threshold-ms=10000, sample-interval-ms=1000, max-records=1024, max-stack-frames=32. To disable it or override the threshold:

loomcache.diagnostics.slow-operation.enabled=false
loomcache.diagnostics.slow-operation.threshold-ms=5000
loomcache.diagnostics.slow-operation.sample-interval-ms=1000
loomcache.diagnostics.slow-operation.max-records=1024
loomcache.diagnostics.slow-operation.max-stack-frames=32

Spring Boot embedded nodes use the equivalent loomcache.server.diagnostics.slow-operation.* properties.

Repeated warning suppression

Repeated warning suppression keeps recovery logs readable during CP subsystem reset fallout. The first warning for a repeated condition is emitted immediately; repeats inside the suppression window are counted and summarized on the next emitted warning. It is currently applied to expired-session CP resource cleanup and stale fencing-token validation warnings.

Failure detector

The failure detector tracks per-peer heartbeat inter-arrival timing, computes a phi suspicion score, and flips peer availability when phi crosses the (optionally adaptive) threshold. It exports a failure_detector_phi gauge per peer. The Spring Boot /api/cluster/health endpoint returns a coarse UP/DOWN readiness status (node running, Raft running, leader known); the richer /api/cluster/status endpoint (ADMIN) reports Raft membership, term, leader, and commit index.

For listener and port allocation, see Default Ports. For the health and diagnostics properties, see the Configuration Reference.

LoomCache is an independent open-source project. It is not affiliated with, endorsed by, or sponsored by Hazelcast, Inc. or by any other company whose products are named in this documentation. “Hazelcast” is a trademark of Hazelcast, Inc.; references to it are nominative and describe only migration and comparison. All other product and company names are trademarks of their respective owners and are used for identification purposes only.