Class PhiAccrualFailureDetector
Unlike fixed timeouts, this adapts to network conditions automatically. Used by Apache Cassandra for gossip-based failure detection.
Key Concepts
- phi = -log10(P_later(timeSinceLastHeartbeat))
- phi > threshold → node suspected dead
- Threshold: 8.0 for LAN, 10-12 for WAN/cloud
- Adaptive threshold: automatically adjusted based on observed network latency variance
Implementation Details
Based on "The Phi Accrual Failure Detector" paper (Hayashibara et al., 2004). The detector maintains a sliding window of inter-arrival times and models them using an exponential distribution. For each heartbeat received, the time since the previous one is recorded. Phi is calculated as the negative logarithm (base 10) of the probability that a heartbeat would have arrived later than the current elapsed time, had the node been alive.Adaptive Threshold
When enabled, the threshold is adjusted dynamically based on the coefficient of variation (stddev/mean) of inter-arrival times. High variance environments (WAN, cloud) automatically get higher thresholds to reduce false positives. The adaptation is bounded to prevent extreme values.Metrics
Exports the following Micrometer metrics (when registry is provided):failure_detector_phi{peer="node-X"}— current phi value (gauge)failure_detector_suspected_total{peer="node-X"}— suspicion transitions (counter)
Thread Safety
Uses ReentrantReadWriteLock for concurrent access. All operations are virtual-thread friendly and avoid blocking in hot paths. Read locks are preferred for phi() calculations.-
Constructor Summary
ConstructorsConstructorDescriptionCreate a phi-accrual failure detector with default settings. - phi threshold: 8.0 (suitable for LAN) - sample size: 100 - adaptive threshold: disabled - no metricsPhiAccrualFailureDetector(double phiThreshold) Create a phi-accrual failure detector with custom threshold.PhiAccrualFailureDetector(double phiThreshold, int sampleSize) Create a phi-accrual failure detector with custom configuration (no metrics).PhiAccrualFailureDetector(String peerId, double phiThreshold, int sampleSize, boolean enableAdaptiveThreshold, @Nullable io.micrometer.core.instrument.MeterRegistry meterRegistry) Create a phi-accrual failure detector with full custom configuration.PhiAccrualFailureDetector(String peerId, double threshold, io.micrometer.core.instrument.MeterRegistry registry) Create a phi-accrual failure detector with metrics support. -
Method Summary
Modifier and TypeMethodDescriptiondoubleGet the effective phi threshold (may be adapted if adaptive thresholding is enabled).doubleGet the mean inter-arrival time (in milliseconds).doubleGet the base (non-adaptive) phi threshold.intGet the current number of inter-arrival time samples.doubleGet the standard deviation of inter-arrival times (in milliseconds).doubleGet the elapsed time since the last heartbeat (in milliseconds).voidRecord the arrival of a heartbeat from the monitored node.booleanCheck if adaptive thresholding is enabled.booleanDetermine if the monitored node is currently available.doublephi()Calculate the current phi value.voidreset()Reset the detector state.
-
Constructor Details
-
PhiAccrualFailureDetector
public PhiAccrualFailureDetector(String peerId, double phiThreshold, int sampleSize, boolean enableAdaptiveThreshold, @Nullable io.micrometer.core.instrument.MeterRegistry meterRegistry) Create a phi-accrual failure detector with full custom configuration.- Parameters:
peerId- identifier of the peer being monitored (for logging and metrics)phiThreshold- the phi value above which a node is considered suspicioussampleSize- the window size for inter-arrival timesenableAdaptiveThreshold- whether to adaptively adjust threshold based on latency variancemeterRegistry- optional MeterRegistry for metrics (can be null to disable metrics)
-
PhiAccrualFailureDetector
public PhiAccrualFailureDetector(double phiThreshold, int sampleSize) Create a phi-accrual failure detector with custom configuration (no metrics).- Parameters:
phiThreshold- the phi value above which a node is considered suspicioussampleSize- the window size for inter-arrival times
-
PhiAccrualFailureDetector
public PhiAccrualFailureDetector()Create a phi-accrual failure detector with default settings. - phi threshold: 8.0 (suitable for LAN) - sample size: 100 - adaptive threshold: disabled - no metrics -
PhiAccrualFailureDetector
public PhiAccrualFailureDetector(double phiThreshold) Create a phi-accrual failure detector with custom threshold.- Parameters:
phiThreshold- the phi value above which a node is considered suspicious
-
PhiAccrualFailureDetector
public PhiAccrualFailureDetector(String peerId, double threshold, io.micrometer.core.instrument.MeterRegistry registry) Create a phi-accrual failure detector with metrics support.- Parameters:
peerId- identifier of the peer being monitoredthreshold- the phi thresholdregistry- the MeterRegistry for exporting metrics
-
-
Method Details
-
heartbeat
public void heartbeat()Record the arrival of a heartbeat from the monitored node.Updates the inter-arrival time since the previous heartbeat and maintains the sliding window. This method is thread-safe and virtual-thread friendly (uses ReentrantReadWriteLock, not synchronized).
-
phi
public double phi()Calculate the current phi value.Phi represents the suspicion level that the node has crashed. Higher values indicate greater suspicion.
The formula is: phi = -log10(P_later(timeSinceLastHeartbeat)) where P_later is the probability that a heartbeat would arrive later than the given time, assuming exponential inter-arrival distribution.
Returns 0.0 if no inter-arrival times have been recorded yet.
- Returns:
- the current phi value (typically 0-15)
-
isAvailable
public boolean isAvailable()Determine if the monitored node is currently available.A node is considered available if phi invalid input: '<' effective threshold. The effective threshold is either the base threshold or the adapted threshold (if adaptive thresholding is enabled).
This is the primary interface for failure detection. Tracks suspicion transitions for metrics.
- Returns:
- true if the node is available (alive), false if suspected dead
-
getPhiThreshold
public double getPhiThreshold()Get the base (non-adaptive) phi threshold.- Returns:
- the base phi threshold
-
getEffectiveThreshold
public double getEffectiveThreshold()Get the effective phi threshold (may be adapted if adaptive thresholding is enabled).- Returns:
- the effective threshold used for availability determination
-
isAdaptiveThresholdEnabled
public boolean isAdaptiveThresholdEnabled()Check if adaptive thresholding is enabled.- Returns:
- true if adaptive threshold adjustment is active
-
getSampleCount
public int getSampleCount()Get the current number of inter-arrival time samples.- Returns:
- number of samples currently in the window
-
getMeanInterArrivalMs
public double getMeanInterArrivalMs()Get the mean inter-arrival time (in milliseconds). Returns 0.0 if no samples have been recorded.- Returns:
- mean inter-arrival time in milliseconds
-
getStdDevInterArrivalMs
public double getStdDevInterArrivalMs()Get the standard deviation of inter-arrival times (in milliseconds). Returns 0.0 if no samples have been recorded.- Returns:
- standard deviation of inter-arrival times in milliseconds
-
getTimeSinceLastHeartbeatMs
public double getTimeSinceLastHeartbeatMs()Get the elapsed time since the last heartbeat (in milliseconds).- Returns:
- time in milliseconds since the last recorded heartbeat
-
reset
public void reset()Reset the detector state. Useful for reinitializing a previously failed node. Clears all inter-arrival time samples and resets adaptive threshold to base value.
-