Class PhiAccrualFailureDetector

java.lang.Object
com.loomcache.server.cluster.PhiAccrualFailureDetector

public class PhiAccrualFailureDetector extends Object
Phi-Accrual Failure Detector — adaptive failure detection that calculates a continuous suspicion level (phi) based on inter-arrival times of heartbeats.

Unlike fixed timeouts, this adapts to network conditions automatically. Used by Apache Cassandra for gossip-based failure detection.

Key Concepts

  • phi = -log10(P_later(timeSinceLastHeartbeat))
  • phi > threshold → node suspected dead
  • Threshold: 8.0 for LAN, 10-12 for WAN/cloud
  • Adaptive threshold: automatically adjusted based on observed network latency variance

Implementation Details

Based on "The Phi Accrual Failure Detector" paper (Hayashibara et al., 2004). The detector maintains a sliding window of inter-arrival times and models them using an exponential distribution. For each heartbeat received, the time since the previous one is recorded. Phi is calculated as the negative logarithm (base 10) of the probability that a heartbeat would have arrived later than the current elapsed time, had the node been alive.

Adaptive Threshold

When enabled, the threshold is adjusted dynamically based on the coefficient of variation (stddev/mean) of inter-arrival times. High variance environments (WAN, cloud) automatically get higher thresholds to reduce false positives. The adaptation is bounded to prevent extreme values.

Metrics

Exports the following Micrometer metrics (when registry is provided):
  • failure_detector_phi{peer="node-X"} — current phi value (gauge)
  • failure_detector_suspected_total{peer="node-X"} — suspicion transitions (counter)

Thread Safety

Uses ReentrantReadWriteLock for concurrent access. All operations are virtual-thread friendly and avoid blocking in hot paths. Read locks are preferred for phi() calculations.
  • Constructor Summary

    Constructors
    Constructor
    Description
    Create a phi-accrual failure detector with default settings. - phi threshold: 8.0 (suitable for LAN) - sample size: 100 - adaptive threshold: disabled - no metrics
    PhiAccrualFailureDetector(double phiThreshold)
    Create a phi-accrual failure detector with custom threshold.
    PhiAccrualFailureDetector(double phiThreshold, int sampleSize)
    Create a phi-accrual failure detector with custom configuration (no metrics).
    PhiAccrualFailureDetector(String peerId, double phiThreshold, int sampleSize, boolean enableAdaptiveThreshold, @Nullable io.micrometer.core.instrument.MeterRegistry meterRegistry)
    Create a phi-accrual failure detector with full custom configuration.
    PhiAccrualFailureDetector(String peerId, double threshold, io.micrometer.core.instrument.MeterRegistry registry)
    Create a phi-accrual failure detector with metrics support.
  • Method Summary

    Modifier and Type
    Method
    Description
    double
    Get the effective phi threshold (may be adapted if adaptive thresholding is enabled).
    double
    Get the mean inter-arrival time (in milliseconds).
    double
    Get the base (non-adaptive) phi threshold.
    int
    Get the current number of inter-arrival time samples.
    double
    Get the standard deviation of inter-arrival times (in milliseconds).
    double
    Get the elapsed time since the last heartbeat (in milliseconds).
    void
    Record the arrival of a heartbeat from the monitored node.
    boolean
    Check if adaptive thresholding is enabled.
    boolean
    Determine if the monitored node is currently available.
    double
    phi()
    Calculate the current phi value.
    void
    Reset the detector state.

    Methods inherited from class Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • PhiAccrualFailureDetector

      public PhiAccrualFailureDetector(String peerId, double phiThreshold, int sampleSize, boolean enableAdaptiveThreshold, @Nullable io.micrometer.core.instrument.MeterRegistry meterRegistry)
      Create a phi-accrual failure detector with full custom configuration.
      Parameters:
      peerId - identifier of the peer being monitored (for logging and metrics)
      phiThreshold - the phi value above which a node is considered suspicious
      sampleSize - the window size for inter-arrival times
      enableAdaptiveThreshold - whether to adaptively adjust threshold based on latency variance
      meterRegistry - optional MeterRegistry for metrics (can be null to disable metrics)
    • PhiAccrualFailureDetector

      public PhiAccrualFailureDetector(double phiThreshold, int sampleSize)
      Create a phi-accrual failure detector with custom configuration (no metrics).
      Parameters:
      phiThreshold - the phi value above which a node is considered suspicious
      sampleSize - the window size for inter-arrival times
    • PhiAccrualFailureDetector

      public PhiAccrualFailureDetector()
      Create a phi-accrual failure detector with default settings. - phi threshold: 8.0 (suitable for LAN) - sample size: 100 - adaptive threshold: disabled - no metrics
    • PhiAccrualFailureDetector

      public PhiAccrualFailureDetector(double phiThreshold)
      Create a phi-accrual failure detector with custom threshold.
      Parameters:
      phiThreshold - the phi value above which a node is considered suspicious
    • PhiAccrualFailureDetector

      public PhiAccrualFailureDetector(String peerId, double threshold, io.micrometer.core.instrument.MeterRegistry registry)
      Create a phi-accrual failure detector with metrics support.
      Parameters:
      peerId - identifier of the peer being monitored
      threshold - the phi threshold
      registry - the MeterRegistry for exporting metrics
  • Method Details

    • heartbeat

      public void heartbeat()
      Record the arrival of a heartbeat from the monitored node.

      Updates the inter-arrival time since the previous heartbeat and maintains the sliding window. This method is thread-safe and virtual-thread friendly (uses ReentrantReadWriteLock, not synchronized).

    • phi

      public double phi()
      Calculate the current phi value.

      Phi represents the suspicion level that the node has crashed. Higher values indicate greater suspicion.

      The formula is: phi = -log10(P_later(timeSinceLastHeartbeat)) where P_later is the probability that a heartbeat would arrive later than the given time, assuming exponential inter-arrival distribution.

      Returns 0.0 if no inter-arrival times have been recorded yet.

      Returns:
      the current phi value (typically 0-15)
    • isAvailable

      public boolean isAvailable()
      Determine if the monitored node is currently available.

      A node is considered available if phi invalid input: '<' effective threshold. The effective threshold is either the base threshold or the adapted threshold (if adaptive thresholding is enabled).

      This is the primary interface for failure detection. Tracks suspicion transitions for metrics.

      Returns:
      true if the node is available (alive), false if suspected dead
    • getPhiThreshold

      public double getPhiThreshold()
      Get the base (non-adaptive) phi threshold.
      Returns:
      the base phi threshold
    • getEffectiveThreshold

      public double getEffectiveThreshold()
      Get the effective phi threshold (may be adapted if adaptive thresholding is enabled).
      Returns:
      the effective threshold used for availability determination
    • isAdaptiveThresholdEnabled

      public boolean isAdaptiveThresholdEnabled()
      Check if adaptive thresholding is enabled.
      Returns:
      true if adaptive threshold adjustment is active
    • getSampleCount

      public int getSampleCount()
      Get the current number of inter-arrival time samples.
      Returns:
      number of samples currently in the window
    • getMeanInterArrivalMs

      public double getMeanInterArrivalMs()
      Get the mean inter-arrival time (in milliseconds). Returns 0.0 if no samples have been recorded.
      Returns:
      mean inter-arrival time in milliseconds
    • getStdDevInterArrivalMs

      public double getStdDevInterArrivalMs()
      Get the standard deviation of inter-arrival times (in milliseconds). Returns 0.0 if no samples have been recorded.
      Returns:
      standard deviation of inter-arrival times in milliseconds
    • getTimeSinceLastHeartbeatMs

      public double getTimeSinceLastHeartbeatMs()
      Get the elapsed time since the last heartbeat (in milliseconds).
      Returns:
      time in milliseconds since the last recorded heartbeat
    • reset

      public void reset()
      Reset the detector state. Useful for reinitializing a previously failed node. Clears all inter-arrival time samples and resets adaptive threshold to base value.