Class CacheNode

java.lang.Object
com.loomcache.server.CacheNode
All Implemented Interfaces:
MessageHandler

public class CacheNode extends Object implements MessageHandler
A single cache cluster node — Consistency-by-default architecture.

Replication Model (v1): Full Replication

All nodes participate in a SINGLE Raft group. Every node holds ALL data. This is similar to etcd and ZooKeeper — simple, strong consistency, at the cost of storage efficiency for very large datasets.

Writes: go through Raft consensus on the leader, replicated to all followers. Reads: served locally on any node (leader uses lease-based quorum reads).

Cluster Isolation

Every node is bound to a ClusterConfig that carries a cluster name/id. Nodes with different cluster ids ignore each other's JOIN requests, which enables parallel integration tests with isolated clusters.

Lifecycle

1. start() — binds TCP port, starts accept loop 2. connectToSeeds — connects to seeds, exchanges cluster ID 3. heartbeat — pings peers every 2s 4. failure detect — marks peers dead after 6s silence 5. stop() — graceful shutdown
  • Field Details

  • Constructor Details

    • CacheNode

      public CacheNode(ClusterConfig config)
      Primary constructor — uses ClusterConfig.
    • CacheNode

      public CacheNode(ClusterConfig config, @Nullable io.micrometer.core.instrument.MeterRegistry meterRegistry)
      Primary constructor with custom MeterRegistry for metrics collection.
      Parameters:
      config - the cluster configuration
      meterRegistry - the MeterRegistry for metrics (or null to use SimpleMeterRegistry)
    • CacheNode

      public CacheNode(String nodeId, String host, int port, int instanceNumber, List<String> seedAddresses)
      Legacy constructor — wraps into ClusterConfig with a random cluster UUID.
  • Method Details

    • getRaftNodeApi

      public RaftNodeApi getRaftNodeApi()
      Narrow view of raftNode for callers that only need the query/submit RaftNodeApi. Used primarily so tests can depend on the interface (and supply mock(RaftNodeApi.class)) instead of the concrete Raft implementation. Prefer this over
      invalid reference
      #getRaftNode()
      when concrete lifecycle methods are not needed.
    • forceLeaderStepDown

      public boolean forceLeaderStepDown()
      Force the local Raft leader to step down for operator-driven failover or maintenance.
      Returns:
      true when this node was the leader and step-down was requested
    • triggerHotBackup

      public HotBackupManager.HotBackupMetadata triggerHotBackup() throws IOException
      Trigger an operator Hot Backup into ClusterConfig.persistenceBackupDir(). The live WAL and snapshot stores are not modified or compacted.
      Returns:
      metadata describing the backup manifest and group snapshot files
      Throws:
      IOException - if writing the backup fails
      IllegalStateException - if no backup directory is configured
    • start

      public void start() throws IOException
      Throws:
      IOException
    • startWithoutRaft

      public void startWithoutRaft() throws IOException
      Start networking, discovery, and background services without arming the Raft election timer yet. Intended for tests that need to form the TCP mesh before starting consensus.
      Throws:
      IOException
    • startRaft

      public void startRaft()
      Start the Raft subsystem after the node has already bootstrapped its networking stack. Safe to call multiple times.
    • shutdownGracefully

      public void shutdownGracefully(Duration drainTimeout)
      Gracefully shut down the cache node with a configurable drain timeout.

      Shutdown sequence:

      1. Set DRAINING state to reject new connections/requests
      2. Wait for in-flight requests to complete (up to drain timeout)
      3. Stop health checker and discovery mechanisms
      4. Step down from Raft leadership (if leader)
      5. Stop Raft consensus engine
      6. Stop other components (metrics, audit logger)
      7. Broadcast LEAVE message to cluster peers
      8. Close all connections and TCP server
      Parameters:
      drainTimeout - the timeout to wait for in-flight requests to complete
    • shutdownForTest

      public void shutdownForTest()
      Stop a test-owned node without broadcasting a cluster leave.

      Integration tests create many short-lived clusters in parallel. Treating every teardown as a production graceful leave makes surviving peers run membership-change and Raft-removal paths after the test already ended, which creates background churn that interferes with neighboring test clusters. Test shutdown is intentionally local: stop accepting work, stop Raft and background components, then close sockets.

    • stop

      public void stop()
      Gracefully shut down the cache node with a default drain timeout of 5 seconds.

      This is a convenience method that calls shutdownGracefully(Duration) with a 5-second drain timeout. It's suitable for most rolling restart scenarios.

    • awaitStarted

      public boolean awaitStarted(long timeout, TimeUnit unit) throws InterruptedException
      Throws:
      InterruptedException
    • raftGroupForKey

      public RaftNodeApi raftGroupForKey(String key)
      Resolve the RaftNode that should propose mutations for the given key.

      Routing semantics:

      • When sharding is disabled (the default, ClusterConfig.shardingEnabled()==false), returns raftNode — the single v1.x consensus group. Callers that route through this helper see byte-for-byte identical behavior to raw raftNode.propose(...).
      • When sharding is enabled, delegates to PartitionRouter.getRaftNodeForKey(String) which hashes the key into one of numGroups partitions and returns the owning RaftGroupManager group (lazily creating it on first access).

      This is the single chokepoint for key-based Raft-group routing. Handler code that needs to propose mutations keyed by a data-structure key should route through this method instead of touching raftNode directly. Reads remain local to the registry and do not go through this helper.

      Per-partition snapshots: snapshots remain at the group level — each Raft group snapshots its own partitions as a single blob via DataStructureRegistry.snapshotAllData(). Per-partition snapshots within a group (which would enable faster rebalancing of individual partitions) are deferred future work.

      Parameters:
      key - the data-structure key to route (must not be null)
      Returns:
      the RaftNode that owns this key — never null
      Since:
      2.0
    • addSecurityInterceptor

      public void addSecurityInterceptor(SecurityInterceptor interceptor)
      Register a global security interceptor invoked around external client operations.
    • removeSecurityInterceptor

      public boolean removeSecurityInterceptor(SecurityInterceptor interceptor)
      Remove a previously registered security interceptor.
    • addSocketInterceptor

      public void addSocketInterceptor(SocketInterceptor interceptor)
      Register a layer-4 socket interceptor for accepted inbound TCP connections.
    • removeSocketInterceptor

      public boolean removeSocketInterceptor(SocketInterceptor interceptor)
      Remove a previously registered layer-4 socket interceptor.
    • addMigrationListener

      public void addMigrationListener(MigrationListener listener)
      Register an operator listener for partition migration lifecycle events.
    • removeMigrationListener

      public boolean removeMigrationListener(MigrationListener listener)
      Remove a previously registered partition migration listener.
    • getOperationalState

      public ClusterState.OperationalState getOperationalState()
      Return the current cluster operational state.
    • getClusterVersion

      public LoomVersion getClusterVersion()
      Return the cluster version gate for rolling-upgrade operations.
    • changeClusterVersion

      public LoomVersion changeClusterVersion(LoomVersion newVersion)
      Change the cluster version gate and return the previous version.
    • resetCpSubsystem

      public ConsistencySubsystem.ResetResult resetCpSubsystem()
      Reset local CP subsystem state without touching AP data structures.
    • listCpGroups

      public List<CacheNode.CpGroupSummary> listCpGroups()
      Lists instantiated CP/Raft groups for operator inspection.
    • listCpSessions

      public List<ConsistencySubsystem.CpSessionInfo> listCpSessions()
      Lists active CP sessions for operator inspection.
    • forceCloseCpSession

      public boolean forceCloseCpSession(String sessionId)
      Force-closes a CP session and releases its tracked resources.
    • transitionOperationalState

      public void transitionOperationalState(ClusterState.OperationalState newState)
      Transition this node's view of the cluster operational state.
    • handleMessage

      public @Nullable Message handleMessage(Message message, ConnectionContext sender)
      Description copied from interface: MessageHandler
      Handle an incoming message and optionally return a response.
      Specified by:
      handleMessage in interface MessageHandler
      Parameters:
      message - the incoming message
      sender - the connection context of the sender
      Returns:
      response message, or null if no response needed
    • onPeerDisconnect

      public void onPeerDisconnect(String peerId)
      Called by TcpServer when a peer connection is closed. Cleans up subscriptions and listener registrations for the disconnected peer.
      Specified by:
      onPeerDisconnect in interface MessageHandler
      Parameters:
      peerId - the peer ID that disconnected
    • parseSeed

      public static CacheNode.ParsedSeed parseSeed(String raw)
      Parse a seed address supporting IPv4 (host:port) and bracketed IPv6 ([2001:db8::1]:port) forms. Raw colon-bearing hosts are rejected to avoid misinterpreting IPv6 literals as host:port.
    • reconnectToSeeds

      public void reconnectToSeeds()
      Re-attempt connections to configured seed nodes. Useful for tests that start nodes sequentially — earlier nodes will have failed to connect to later nodes that were not yet running.
    • isRunning

      public boolean isRunning()
    • connectionCount

      public int connectionCount()
      Returns the number of active TCP connections to peer nodes. Useful for test harnesses that need to verify mesh connectivity before starting Raft consensus.
    • verifiedClusterConnectionCount

      public int verifiedClusterConnectionCount()
      Returns the number of verified cluster-peer connections. Excludes temporary seed/pending identities that are still converging.
    • aliveMemberIds

      public Set<String> aliveMemberIds()