Class LoomMetrics

java.lang.Object
com.loomcache.server.metrics.LoomMetrics

public class LoomMetrics extends Object
LoomCache metrics instrumentation using Micrometer.

Exposes metrics across four domains: - Raft consensus: term, commit index, log size, election counts, replication lag, follower lag (ms since last ACK), follower match index (replication progress) - Command latency breakdown: parse, queue wait, Raft commit, state machine apply (nanoseconds) - TCP networking: active connections, messages sent/received, send errors - Cache data structures: entry counts, hit/miss ratios, evictions

All metrics are tagged with the node ID for multi-node deployments. Gauges use Supplier-based lambdas to pull live values from their sources. Timers are used for latency measurements and automatically record percentiles and counts.

  • Method Details

    • create

      public static LoomMetrics create(io.micrometer.core.instrument.MeterRegistry registry, String nodeId)
      Static factory method to create and initialize LoomMetrics.
      Parameters:
      registry - the MeterRegistry to register metrics with (must not be null)
      nodeId - the node ID for tagging metrics (must not be null)
      Returns:
      a new LoomMetrics instance
      Throws:
      NullPointerException - if registry or nodeId is null
    • registerRaftMetrics

      public void registerRaftMetrics(RaftNodeApi raftNode)
      Register all Raft consensus metrics. These metrics track leader election, log replication, and state machine consistency.
      Parameters:
      raftNode - the RaftNode to instrument (must not be null)
      Throws:
      NullPointerException - if raftNode is null
    • registerCpSubsystemMetrics

      public void registerCpSubsystemMetrics()
      Register CP Subsystem (locks, semaphores) metrics. Tracks active locks, semaphore permits, and pending operations.
    • registerQueryMetrics

      public void registerQueryMetrics()
      Register query metrics for distributed queries and scans.
    • registerIndexMetrics

      public void registerIndexMetrics(DataStructureRegistry dataStructures, @Nullable PartitionRouter partitionRouter)
      Register index observability metrics for query planning.

      Metrics landed:

      • loomcache.indexes.skipped_query_count — queries where a matching index existed but execution scanned
      • loomcache.indexes.no_matching_index_query_count — predicate queries with no matching index
      • loomcache.indexes.partitions_indexed — local partition slots covered by at least one index catalog entry
      Parameters:
      dataStructures - registry holding SQL index metadata
      partitionRouter - partition router, or null in single-group mode
    • recordIndexesSkippedQuery

      public void recordIndexesSkippedQuery()
    • recordNoMatchingIndexQuery

      public void recordNoMatchingIndexQuery()
    • registerPipelineMetrics

      public void registerPipelineMetrics()
      Register pipeline metrics for pipelined command execution.
    • registerCommandLatencyMetrics

      public void registerCommandLatencyMetrics()
      Register command latency breakdown metrics. These timers break down the end-to-end latency of write commands into stages: - parse_ns: time to deserialize and validate incoming command - queue_wait_ns: time spent waiting in execution queue - raft_commit_ns: time for Raft consensus/replication - apply_ns: time to apply to state machine

      Timers track count, total time, mean, and percentiles (p50, p95, p99). This enables bottleneck diagnosis: is slowness due to network, consensus, or state machine?

    • registerNetworkMetrics

      public void registerNetworkMetrics(TcpServer tcpServer)
      Register all TCP network metrics. These metrics track connection counts, message throughput, and send errors.
      Parameters:
      tcpServer - the TcpServer to instrument (must not be null)
      Throws:
      NullPointerException - if tcpServer is null
    • registerCacheMetrics

      public void registerCacheMetrics(DataStructureRegistry dataStructures)
      Register all cache data structure metrics. These metrics track cache entry counts, hit/miss ratios, and evictions.
      Parameters:
      dataStructures - the DataStructureRegistry (must not be null)
      Throws:
      NullPointerException - if dataStructures is null
    • registerSnapshotMetrics

      public void registerSnapshotMetrics()
      Register snapshot and WAL fsync batching metrics. These metrics track snapshot installation latency, save/load performance, and batch fsync behavior.
    • updateRaftMetrics

      public void updateRaftMetrics(RaftNodeApi raftNode)
      Update Raft election metrics from the RaftNode. Call this periodically or after election events.
      Parameters:
      raftNode - the RaftNode to query (must not be null)
      Throws:
      NullPointerException - if raftNode is null
    • updateFollowerLagMetrics

      public void updateFollowerLagMetrics(RaftNodeApi raftNode)
      Update per-follower replication lag metrics from RaftNode. Should be called periodically (e.g., every heartbeat interval) to track follower lag.

      Replication lag = time elapsed since last successful AppendEntries ACK. Monitor for lag > (5 * heartbeatInterval) to detect degraded followers.

      Parameters:
      raftNode - the RaftNode to query for follower state (must not be null)
      Throws:
      NullPointerException - if raftNode is null
    • updateFollowerMatchIndexMetrics

      public void updateFollowerMatchIndexMetrics(RaftNodeApi raftNode)
      Update per-follower match index metrics from RaftNode. Should be called periodically to track replication progress on each follower.

      Match index = highest log index known to be replicated on this peer. A peer with low match index relative to commit index is lagging behind.

      Parameters:
      raftNode - the RaftNode to query for match index state (must not be null)
      Throws:
      NullPointerException - if raftNode is null
    • updateNetworkMetrics

      public void updateNetworkMetrics(TcpServer tcpServer)
      Update TCP network metrics from the TcpServer. Call this periodically or after significant network events.
      Parameters:
      tcpServer - the TcpServer to query (must not be null)
      Throws:
      NullPointerException - if tcpServer is null
    • updateCacheMetrics

      public void updateCacheMetrics(@Nullable DistributedMap<?,?> map)
      Update cache metrics from the DistributedMap. Call this periodically or after cache operations.
      Parameters:
      map - the DistributedMap to query (can be null)
    • removePeerMetrics

      public void removePeerMetrics(String peerId)
      Remove all metrics gauges for a peer that has left the cluster. Removes entries from the backing maps and deregisters the corresponding Micrometer Gauge instances to prevent memory leaks and metric pollution.
      Parameters:
      peerId - the ID of the peer to remove metrics for (must not be null)
      Throws:
      NullPointerException - if peerId is null
    • cleanupStalePeerMetrics

      public void cleanupStalePeerMetrics(Set<String> activePeers)
      Remove metrics for peers that are no longer in the active cluster membership. This is a safety-net sweep intended to be called periodically (e.g., every 5 minutes) to catch any peers whose removal was missed by the direct removePeerMetrics() calls.
      Parameters:
      activePeers - the current set of active cluster member IDs (must not be null)
      Throws:
      NullPointerException - if activePeers is null
    • registerTransactionMetrics

      public void registerTransactionMetrics()
      Register cross-group 2PC latency histograms + abort/recovery counters described in BLK-2026-04-22-007 Day 9. Call from CacheNode's metrics-wiring phase once the TwoPhaseCoordinator + Participant exist.

      Metrics landed:

      • loomcache.tx.crossgroup.commit.latency — end-to-end TX duration
      • loomcache.tx.crossgroup.prepare.duration — PREPARE phase wall-time
      • loomcache.tx.crossgroup.decide.duration — DECIDE phase wall-time
      • loomcache.tx.crossgroup.aborts{reason} — aborts by cause (vote_no, timeout, participant_unreachable)
      • loomcache.tx.crossgroup.recoveries{cause} — recoveries by cause (coord_crash, participant_crash)
    • recordCrossGroupAbort

      public void recordCrossGroupAbort(String reason)
      Record a cross-group 2PC abort tagged by reason.
      Parameters:
      reason - one of vote_no, timeout, participant_unreachable
    • recordCrossGroupRecovery

      public void recordCrossGroupRecovery(String cause)
      Record a cross-group 2PC recovery tagged by cause.
      Parameters:
      cause - one of coord_crash, participant_crash
    • registerPartitionMigrationMetrics

      public void registerPartitionMigrationMetrics()
      Register partition-migration metrics for the BLK-2026-04-22-007 Day 6-8 data-ship protocol. Counters increment per chunk applied; timer records per-slot total migration duration.

      Metrics landed:

      • loomcache.partition.migration.duration — per-slot wall-time
      • loomcache.partition.migration.bytes — bytes shipped (monotonic)
      • loomcache.partition.migration.keys — keys migrated (monotonic)
    • registerBackupPromotionMetrics

      public void registerBackupPromotionMetrics()
      Register metrics for partition ownership promotions after a departed owner.

      Metrics landed:

      • loomcache.partition.backup_promotion.count — promoted slot count
      • loomcache.partition.backup_promotion.duration — promotion wall-time
    • recordBackupPromotion

      public void recordBackupPromotion(int promotedSlotCount, Duration elapsed)
    • registerHandshakeMetrics

      public void registerHandshakeMetrics()
      Register protocol-handshake metrics. Increment handshakeAttempts on every handshake initiation, handshakeFailures on any IO or parse error, and handshakeRejected specifically when the peer's version fails the compatibility matrix.

      Metrics landed:

      • loomcache.handshake.attempts — count of handshake initiations
      • loomcache.handshake.failures — count of handshake IO/parse errors
      • loomcache.handshake.rejected — count of incompatible-peer rejections