Class PartitionDetector

java.lang.Object
com.loomcache.server.cluster.PartitionDetector

public class PartitionDetector extends Object
Network partition detector for cluster health monitoring.

Detects three types of network partitions:

  • NO_PARTITION: All nodes are reachable and healthy
  • PARTIAL: Asymmetric partitions where some nodes can reach others but not vice versa
  • FULL_PARTITION: Complete network split (majority vs minority partitions)

Uses phi-accrual failure detection (based on inter-arrival times of heartbeats) to adaptively detect node failures without fixed timeouts.

Configuration

Configured via PartitionDetector.DetectorConfig record, with optional history retention override:
  • phiThreshold: phi value above which a node is suspected dead (default: 8.0)
  • heartbeatIntervalMs: expected interval between heartbeats (used for startup grace and metrics)
  • windowSize: sliding window size for inter-arrival times (default: 100)
  • partitionHistoryLimit: max partition events retained in memory (default: 1024)

Thread Safety

Uses ReentrantReadWriteLock for concurrent access to detector state. All operations are virtual-thread friendly.
See Also:
  • Constructor Details

    • PartitionDetector

      public PartitionDetector(String nodeId, PartitionDetector.DetectorConfig config)
    • PartitionDetector

      public PartitionDetector(String nodeId, PartitionDetector.DetectorConfig config, int partitionHistoryLimit)
      Create a detector with explicit partition history retention.
      Parameters:
      nodeId - the ID of the local node
      config - detector threshold and heartbeat settings
      partitionHistoryLimit - max partition events retained in memory
    • PartitionDetector

      public PartitionDetector(String nodeId)
      Create a detector with default configuration.
      Parameters:
      nodeId - the ID of the local node
  • Method Details

    • recordHeartbeat

      public void recordHeartbeat(String nodeId)
      Record a heartbeat arrival from a node.

      Updates the phi-accrual detector for that node. If the node is not yet tracked, it is automatically registered.

      Parameters:
      nodeId - the ID of the node sending the heartbeat
    • recordReachability

      public void recordReachability(String observerNodeId, String targetNodeId, boolean reachable)
      Record one directed reachability observation.
      Parameters:
      observerNodeId - node that made the observation
      targetNodeId - node being observed
      reachable - whether observer can reach target
    • recordReachabilityReport

      public void recordReachabilityReport(String observerNodeId, Map<String,Boolean> reachableByNode)
      Record a directed reachability report from one peer.
      Parameters:
      observerNodeId - node that made the observations
      reachableByNode - map of target node id to reachability from the observer
    • phi

      public double phi(String nodeId)
      Get the current phi value for a node.

      Higher phi values indicate greater suspicion that the node has failed. If the node is not yet tracked, returns 0.0 (no suspicion).

      Parameters:
      nodeId - the node ID
      Returns:
      the phi value (typically 0-15), or 0.0 if node not yet tracked
    • isAvailable

      public boolean isAvailable(String nodeId)
      Check if a node is available (not suspected dead).
      Parameters:
      nodeId - the node ID
      Returns:
      true if the node is available, false if suspected dead
    • getPartitionStatus

      public PartitionDetector.PartitionStatus getPartitionStatus()
      Get the current partition status.

      Analyzes all tracked nodes to determine if: - All nodes are healthy (NO_PARTITION) - Some nodes are suspected (PARTIAL or FULL_PARTITION)

      Also tracks partition state changes for history and healing callbacks.

      Returns:
      the current PartitionStatus
    • getPartialPartitionRecoveryPlan

      public PartitionDetector.PartitionRecoveryPlan getPartialPartitionRecoveryPlan()
      Build a max-clique recovery plan from known heartbeat and peer reachability data.
      Returns:
      recovery plan over all known nodes
    • planPartialPartitionRecovery

      public PartitionDetector.PartitionRecoveryPlan planPartialPartitionRecovery(Set<String> clusterMembers)
      Build a max-clique recovery plan over an explicit cluster membership set.
      Parameters:
      clusterMembers - cluster members that must be considered by the plan
      Returns:
      recovery plan over the supplied members plus the local node
    • getSuspectedNodes

      public Set<String> getSuspectedNodes()
      Get the set of suspected nodes (nodes with high phi values).
      Returns:
      immutable set of suspected node IDs
    • getTrackedNodes

      public Set<String> getTrackedNodes()
      Get all tracked nodes.
      Returns:
      immutable set of tracked node IDs
    • getPartitionHistory

      public List<PartitionDetector.PartitionEvent> getPartitionHistory(int maxEntries)
      Get recent partition history events.
      Parameters:
      maxEntries - maximum number of entries to return (most recent first)
      Returns:
      list of recent PartitionEvent records
    • getPartitionDiagnostics

      public PartitionDetector.PartitionDiagnostics getPartitionDiagnostics()
      Get current partition diagnostics.
      Returns:
      PartitionDiagnostics with current state
    • registerHealingCallback

      public void registerHealingCallback(Consumer<PartitionDetector.PartitionEvent> callback)
      Register a callback to be invoked when a partition heals.
      Parameters:
      callback - consumer function to invoke with PartitionEvent
    • getDetectorStats

      public PartitionDetector.DetectorStats getDetectorStats()
      Get detector statistics.
      Returns:
      DetectorStats with cumulative statistics
    • reset

      public void reset()
      Reset all detector state. Useful when reinitializing a network or recovering from a partition.
    • getPhiSnapshot

      public Map<String,Double> getPhiSnapshot()
      Get a snapshot of all detector states for monitoring.
      Returns:
      map of nodeId -> phi value
    • toString

      public String toString()
      Overrides:
      toString in class Object