Class LoomMetrics
Exposes metrics across four domains: - Raft consensus: term, commit index, log size, election counts, replication lag, follower lag (ms since last ACK), follower match index (replication progress) - Command latency breakdown: parse, queue wait, Raft commit, state machine apply (nanoseconds) - TCP networking: active connections, messages sent/received, send errors - Cache data structures: entry counts, hit/miss ratios, evictions
All metrics are tagged with the node ID for multi-node deployments. Gauges use Supplier-based lambdas to pull live values from their sources. Timers are used for latency measurements and automatically record percentiles and counts.
-
Method Summary
Modifier and TypeMethodDescriptionvoidcleanupStalePeerMetrics(Set<String> activePeers) Remove metrics for peers that are no longer in the active cluster membership.static LoomMetricsStatic factory method to create and initialize LoomMetrics.voidrecordBackupPromotion(int promotedSlotCount, Duration elapsed) voidrecordCrossGroupAbort(String reason) Record a cross-group 2PC abort tagged by reason.voidrecordCrossGroupRecovery(String cause) Record a cross-group 2PC recovery tagged by cause.voidvoidvoidRegister metrics for partition ownership promotions after a departed owner.voidregisterCacheMetrics(DataStructureRegistry dataStructures) Register all cache data structure metrics.voidRegister command latency breakdown metrics.voidRegister CP Subsystem (locks, semaphores) metrics.voidRegister protocol-handshake metrics.voidregisterIndexMetrics(DataStructureRegistry dataStructures, @Nullable PartitionRouter partitionRouter) Register index observability metrics for query planning.voidregisterNetworkMetrics(TcpServer tcpServer) Register all TCP network metrics.voidRegister partition-migration metrics for the BLK-2026-04-22-007 Day 6-8 data-ship protocol.voidRegister pipeline metrics for pipelined command execution.voidRegister query metrics for distributed queries and scans.voidregisterRaftMetrics(RaftNodeApi raftNode) Register all Raft consensus metrics.voidRegister snapshot and WAL fsync batching metrics.voidRegister cross-group 2PC latency histograms + abort/recovery counters described in BLK-2026-04-22-007 Day 9.voidremovePeerMetrics(String peerId) Remove all metrics gauges for a peer that has left the cluster.voidupdateCacheMetrics(@Nullable DistributedMap<?, ?> map) Update cache metrics from the DistributedMap.voidupdateFollowerLagMetrics(RaftNodeApi raftNode) Update per-follower replication lag metrics from RaftNode.voidupdateFollowerMatchIndexMetrics(RaftNodeApi raftNode) Update per-follower match index metrics from RaftNode.voidupdateNetworkMetrics(TcpServer tcpServer) Update TCP network metrics from the TcpServer.voidupdateRaftMetrics(RaftNodeApi raftNode) Update Raft election metrics from the RaftNode.
-
Method Details
-
create
public static LoomMetrics create(io.micrometer.core.instrument.MeterRegistry registry, String nodeId) Static factory method to create and initialize LoomMetrics.- Parameters:
registry- the MeterRegistry to register metrics with (must not be null)nodeId- the node ID for tagging metrics (must not be null)- Returns:
- a new LoomMetrics instance
- Throws:
NullPointerException- if registry or nodeId is null
-
registerRaftMetrics
Register all Raft consensus metrics. These metrics track leader election, log replication, and state machine consistency.- Parameters:
raftNode- the RaftNode to instrument (must not be null)- Throws:
NullPointerException- if raftNode is null
-
registerCpSubsystemMetrics
public void registerCpSubsystemMetrics()Register CP Subsystem (locks, semaphores) metrics. Tracks active locks, semaphore permits, and pending operations. -
registerQueryMetrics
public void registerQueryMetrics()Register query metrics for distributed queries and scans. -
registerIndexMetrics
public void registerIndexMetrics(DataStructureRegistry dataStructures, @Nullable PartitionRouter partitionRouter) Register index observability metrics for query planning.Metrics landed:
loomcache.indexes.skipped_query_count— queries where a matching index existed but execution scannedloomcache.indexes.no_matching_index_query_count— predicate queries with no matching indexloomcache.indexes.partitions_indexed— local partition slots covered by at least one index catalog entry
- Parameters:
dataStructures- registry holding SQL index metadatapartitionRouter- partition router, or null in single-group mode
-
recordIndexesSkippedQuery
public void recordIndexesSkippedQuery() -
recordNoMatchingIndexQuery
public void recordNoMatchingIndexQuery() -
registerPipelineMetrics
public void registerPipelineMetrics()Register pipeline metrics for pipelined command execution. -
registerCommandLatencyMetrics
public void registerCommandLatencyMetrics()Register command latency breakdown metrics. These timers break down the end-to-end latency of write commands into stages: - parse_ns: time to deserialize and validate incoming command - queue_wait_ns: time spent waiting in execution queue - raft_commit_ns: time for Raft consensus/replication - apply_ns: time to apply to state machineTimers track count, total time, mean, and percentiles (p50, p95, p99). This enables bottleneck diagnosis: is slowness due to network, consensus, or state machine?
-
registerNetworkMetrics
Register all TCP network metrics. These metrics track connection counts, message throughput, and send errors.- Parameters:
tcpServer- the TcpServer to instrument (must not be null)- Throws:
NullPointerException- if tcpServer is null
-
registerCacheMetrics
Register all cache data structure metrics. These metrics track cache entry counts, hit/miss ratios, and evictions.- Parameters:
dataStructures- the DataStructureRegistry (must not be null)- Throws:
NullPointerException- if dataStructures is null
-
registerSnapshotMetrics
public void registerSnapshotMetrics()Register snapshot and WAL fsync batching metrics. These metrics track snapshot installation latency, save/load performance, and batch fsync behavior. -
updateRaftMetrics
Update Raft election metrics from the RaftNode. Call this periodically or after election events.- Parameters:
raftNode- the RaftNode to query (must not be null)- Throws:
NullPointerException- if raftNode is null
-
updateFollowerLagMetrics
Update per-follower replication lag metrics from RaftNode. Should be called periodically (e.g., every heartbeat interval) to track follower lag.Replication lag = time elapsed since last successful AppendEntries ACK. Monitor for lag > (5 * heartbeatInterval) to detect degraded followers.
- Parameters:
raftNode- the RaftNode to query for follower state (must not be null)- Throws:
NullPointerException- if raftNode is null
-
updateFollowerMatchIndexMetrics
Update per-follower match index metrics from RaftNode. Should be called periodically to track replication progress on each follower.Match index = highest log index known to be replicated on this peer. A peer with low match index relative to commit index is lagging behind.
- Parameters:
raftNode- the RaftNode to query for match index state (must not be null)- Throws:
NullPointerException- if raftNode is null
-
updateNetworkMetrics
Update TCP network metrics from the TcpServer. Call this periodically or after significant network events.- Parameters:
tcpServer- the TcpServer to query (must not be null)- Throws:
NullPointerException- if tcpServer is null
-
updateCacheMetrics
Update cache metrics from the DistributedMap. Call this periodically or after cache operations.- Parameters:
map- the DistributedMap to query (can be null)
-
removePeerMetrics
Remove all metrics gauges for a peer that has left the cluster. Removes entries from the backing maps and deregisters the corresponding Micrometer Gauge instances to prevent memory leaks and metric pollution.- Parameters:
peerId- the ID of the peer to remove metrics for (must not be null)- Throws:
NullPointerException- if peerId is null
-
cleanupStalePeerMetrics
Remove metrics for peers that are no longer in the active cluster membership. This is a safety-net sweep intended to be called periodically (e.g., every 5 minutes) to catch any peers whose removal was missed by the direct removePeerMetrics() calls.- Parameters:
activePeers- the current set of active cluster member IDs (must not be null)- Throws:
NullPointerException- if activePeers is null
-
registerTransactionMetrics
public void registerTransactionMetrics()Register cross-group 2PC latency histograms + abort/recovery counters described in BLK-2026-04-22-007 Day 9. Call fromCacheNode's metrics-wiring phase once the TwoPhaseCoordinator + Participant exist.Metrics landed:
loomcache.tx.crossgroup.commit.latency— end-to-end TX durationloomcache.tx.crossgroup.prepare.duration— PREPARE phase wall-timeloomcache.tx.crossgroup.decide.duration— DECIDE phase wall-timeloomcache.tx.crossgroup.aborts{reason}— aborts by cause (vote_no,timeout,participant_unreachable)loomcache.tx.crossgroup.recoveries{cause}— recoveries by cause (coord_crash,participant_crash)
-
recordCrossGroupAbort
Record a cross-group 2PC abort tagged by reason.- Parameters:
reason- one ofvote_no,timeout,participant_unreachable
-
recordCrossGroupRecovery
Record a cross-group 2PC recovery tagged by cause.- Parameters:
cause- one ofcoord_crash,participant_crash
-
registerPartitionMigrationMetrics
public void registerPartitionMigrationMetrics()Register partition-migration metrics for the BLK-2026-04-22-007 Day 6-8 data-ship protocol. Counters increment per chunk applied; timer records per-slot total migration duration.Metrics landed:
loomcache.partition.migration.duration— per-slot wall-timeloomcache.partition.migration.bytes— bytes shipped (monotonic)loomcache.partition.migration.keys— keys migrated (monotonic)
-
registerBackupPromotionMetrics
public void registerBackupPromotionMetrics()Register metrics for partition ownership promotions after a departed owner.Metrics landed:
loomcache.partition.backup_promotion.count— promoted slot countloomcache.partition.backup_promotion.duration— promotion wall-time
-
recordBackupPromotion
-
registerHandshakeMetrics
public void registerHandshakeMetrics()Register protocol-handshake metrics. IncrementhandshakeAttemptson every handshake initiation,handshakeFailureson any IO or parse error, andhandshakeRejectedspecifically when the peer's version fails the compatibility matrix.Metrics landed:
loomcache.handshake.attempts— count of handshake initiationsloomcache.handshake.failures— count of handshake IO/parse errorsloomcache.handshake.rejected— count of incompatible-peer rejections
-