Health Checking
RpcNet uses the Phi Accrual Failure Detector algorithm for accurate and adaptive health checking. This chapter explains how RpcNet determines which nodes are healthy and when to mark them as failed.
The Problem with Binary Health Checks
Traditional health checks use binary logic:
if (ping_timeout):
node_is_failed = True
else:
node_is_healthy = True
Problems:
- Fixed threshold: 500ms timeout doesn't adapt to network conditions
- False positives: Temporary slowdown triggers failure
- False negatives: Slow node stays "healthy" until timeout
- No confidence: Can't express "probably failed" vs "definitely failed"
Phi Accrual Solution
The Phi Accrual algorithm provides a continuous suspicion level instead of binary alive/dead:
Phi Value (Φ) = Suspicion Level
Φ = 0 → Node is responding normally
Φ = 5 → Moderate suspicion (50% chance failed)
Φ = 8 → High suspicion (97.7% chance failed) ← Typical threshold
Φ = 10 → Very high suspicion (99.99% chance failed)
Φ = 15+ → Almost certainly failed
How It Works
1. Track Heartbeat History
#![allow(unused)] fn main() { struct HeartbeatHistory { intervals: Vec<Duration>, // Last N intervals between heartbeats last_heartbeat: Instant, // When we last heard from node } }
2. Calculate Expected Interval
#![allow(unused)] fn main() { fn mean_interval(&self) -> Duration { self.intervals.iter().sum::<Duration>() / self.intervals.len() } fn std_deviation(&self) -> Duration { let mean = self.mean_interval(); let variance = self.intervals .iter() .map(|&interval| { let diff = interval.as_secs_f64() - mean.as_secs_f64(); diff * diff }) .sum::<f64>() / self.intervals.len() as f64; Duration::from_secs_f64(variance.sqrt()) } }
3. Compute Phi
#![allow(unused)] fn main() { fn phi(&self) -> f64 { let now = Instant::now(); let time_since_last = now.duration_since(self.last_heartbeat); let mean = self.mean_interval(); let std_dev = self.std_deviation(); // How many standard deviations away is current delay? let z_score = (time_since_last.as_secs_f64() - mean.as_secs_f64()) / std_dev.as_secs_f64(); // Convert to phi (log probability) -z_score.ln() / 2.0_f64.ln() } }
4. Determine Failure
#![allow(unused)] fn main() { const PHI_THRESHOLD: f64 = 8.0; // Configurable if phi() > PHI_THRESHOLD { mark_node_as_failed(); } }
Visualization
Example 1: Healthy Node
Heartbeats arrive regularly every ~1 second:
Time (s): 0 1 2 3 4 5 6 7 8
Heartbeat: ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Phi: 0 0 0 0 0 0 0 0 0
Status: Healthy (Φ = 0)
Example 2: Temporary Network Glitch
Heartbeats delayed but node recovers:
Time (s): 0 1 2 3 4 5 6 7 8
Heartbeat: ✓ ✓ ✓ . . ✓ ✓ ✓ ✓
Phi: 0 0 0 2 5 2 0 0 0
▲
Elevated but below threshold
Status: Suspect briefly, but recovers (no failure declared)
Example 3: Actual Failure
Heartbeats stop after node crashes:
Time (s): 0 1 2 3 4 5 6 7 8
Heartbeat: ✓ ✓ ✓ X . . . . .
Phi: 0 0 0 2 5 8 11 14 17
▲
Exceeds threshold → FAILED
Status: Failed (Φ = 8+)
Adaptive Behavior
Phi Accrual adapts to network conditions automatically:
Stable Network
History: [1.0s, 1.0s, 1.0s, 1.0s, 1.0s]
Mean: 1.0s
Std Dev: 0.0s (very predictable)
Current delay: 1.5s
Phi: 8.0 → FAILURE (unusual for this stable network)
Variable Network
History: [0.8s, 1.2s, 0.9s, 1.4s, 1.0s]
Mean: 1.06s
Std Dev: 0.24s (more variable)
Current delay: 1.5s
Phi: 3.2 → HEALTHY (normal variation)
Key insight: Same 1.5s delay is interpreted differently based on historical patterns.
RpcNet Implementation
Configuration
#![allow(unused)] fn main() { use rpcnet::cluster::{ClusterConfig, HealthCheckConfig}; use std::time::Duration; let health_config = HealthCheckConfig::default() .with_interval(Duration::from_secs(1)) // Check every 1 second .with_phi_threshold(8.0) // Suspicion threshold .with_history_size(100) // Track last 100 intervals .with_min_std_deviation(Duration::from_millis(50)); // Min variation let cluster_config = ClusterConfig::default() .with_health_check(health_config); let cluster = ClusterMembership::new(cluster_config).await?; }
Monitoring Health
#![allow(unused)] fn main() { // Subscribe to health events let mut events = cluster.subscribe(); while let Some(event) = events.recv().await { match event { ClusterEvent::NodeSuspect(node, phi) => { println!("Node {} suspect (Φ = {:.2})", node.id, phi); } ClusterEvent::NodeFailed(node) => { println!("Node {} failed (Φ exceeded threshold)", node.id); } ClusterEvent::NodeRecovered(node) => { println!("Node {} recovered (Φ back to normal)", node.id); } _ => {} } } }
Custom Phi Threshold
Different thresholds for different applications:
#![allow(unused)] fn main() { // Conservative (fewer false positives, slower detection) .with_phi_threshold(10.0) // 99.99% confidence // Aggressive (faster detection, more false positives) .with_phi_threshold(5.0) // 50% confidence // Recommended default .with_phi_threshold(8.0) // 97.7% confidence }
Choosing Phi Threshold
Threshold | Confidence | False Positive Rate | Detection Time | Use Case |
---|---|---|---|---|
3.0 | 12.5% | Very High | Very Fast | Testing only |
5.0 | 50% | High | Fast | Aggressive failover |
8.0 | 97.7% | Low | Moderate | Recommended |
10.0 | 99.99% | Very Low | Slower | Critical systems |
12.0 | 99.9999% | Extremely Low | Slow | High-latency networks |
Threshold Selection Guide
Low threshold (3-5) if:
- Fast failover is critical
- False positives are acceptable
- Network is very stable
Medium threshold (6-9) if:
- Balance between speed and accuracy
- Typical production environments
- Recommended for most use cases
High threshold (10+) if:
- False positives are very costly
- Network has high variance
- Graceful degradation preferred over fast failover
Integration with SWIM
Phi Accrual works alongside SWIM's failure detection:
┌─────────────────────────────────────────────────────┐
│ SWIM Protocol │
│ │
│ 1. Gossip → Heartbeats to Phi Accrual │
│ 2. Phi Accrual → Computes suspicion level │
│ 3. Φ > threshold → Mark node as Suspect │
│ 4. Indirect probes → Verify with other nodes │
│ 5. Multiple confirmations → Mark node as Failed │
│ 6. Gossip spreads failure → All nodes updated │
└─────────────────────────────────────────────────────┘
Process:
- Regular operation: Nodes exchange gossip messages (heartbeats)
- Phi calculation: Each heartbeat updates Phi Accrual history
- Suspicion: When Φ exceeds threshold, node marked Suspect
- Verification: SWIM performs indirect probes to confirm
- Failure declaration: Multiple nodes agree → Node marked Failed
- Recovery: If heartbeats resume, Φ drops and node marked Alive again
Performance Characteristics
Computational Overhead
#![allow(unused)] fn main() { // Phi calculation per node per check: // - Mean: O(1) with running average // - Std dev: O(1) with running variance // - Phi: O(1) math operations // Total overhead: ~500ns per node per health check }
For 100 nodes checked every 1 second: 0.05ms total CPU time (negligible)
Memory Overhead
#![allow(unused)] fn main() { struct NodeHealth { intervals: VecDeque<Duration>, // 100 entries × 16 bytes = 1.6 KB last_heartbeat: Instant, // 16 bytes running_mean: Duration, // 16 bytes running_variance: f64, // 8 bytes } // Total per node: ~1.7 KB }
For 100 nodes: ~170 KB memory (negligible)
Detection Time
Measured time from actual failure to detection:
Network Stability | Heartbeat Interval | Phi Threshold | Detection Time |
---|---|---|---|
Stable (σ=10ms) | 1s | 8.0 | 2-3s |
Variable (σ=200ms) | 1s | 8.0 | 4-6s |
Unstable (σ=500ms) | 1s | 8.0 | 8-12s |
Tuning for faster detection: Reduce heartbeat interval (e.g., 500ms)
Comparison to Alternatives
vs Fixed Timeout
Fixed Timeout:
✗ Doesn't adapt to network conditions
✗ Binary alive/dead (no confidence)
✓ Simple implementation
Phi Accrual:
✓ Adapts automatically
✓ Continuous suspicion level
✓ Fewer false positives
✗ More complex
vs Heartbeat Count
Heartbeat Count (miss N in a row):
✗ Slow detection (N × interval)
✗ Doesn't account for network variance
✓ Simple logic
Phi Accrual:
✓ Faster detection
✓ Accounts for network patterns
✓ Adaptive threshold
vs Gossip Only
Gossip Only (no Phi):
✗ Hard threshold (suspect → failed)
✗ Doesn't adapt to network
✓ Simpler protocol
Gossip + Phi:
✓ Smooth suspicion curve
✓ Adapts to network conditions
✓ More accurate detection
Best Practices
1. Tune for Your Network
#![allow(unused)] fn main() { // Measure your network characteristics first async fn measure_network_latency() -> (Duration, Duration) { let mut latencies = Vec::new(); for _ in 0..100 { let start = Instant::now(); ping_peer().await.unwrap(); latencies.push(start.elapsed()); } let mean = latencies.iter().sum::<Duration>() / latencies.len(); let variance = latencies.iter() .map(|&d| (d.as_secs_f64() - mean.as_secs_f64()).powi(2)) .sum::<f64>() / latencies.len() as f64; let std_dev = Duration::from_secs_f64(variance.sqrt()); println!("Network latency: {:.2?} ± {:.2?}", mean, std_dev); (mean, std_dev) } // Then configure accordingly let (mean, std_dev) = measure_network_latency().await; let health_config = HealthCheckConfig::default() .with_interval(mean * 2) // Check at 2× mean latency .with_phi_threshold(8.0) .with_min_std_deviation(std_dev); }
2. Monitor Phi Values
#![allow(unused)] fn main() { // Log phi values to understand patterns async fn monitor_phi_values(cluster: Arc<ClusterMembership>) { loop { tokio::time::sleep(Duration::from_secs(10)).await; for node in cluster.nodes().await { let phi = cluster.phi(node.id).await.unwrap_or(0.0); if phi > 5.0 { log::warn!("Node {} phi elevated: {:.2}", node.id, phi); } metrics::gauge!("cluster.node.phi", phi, "node" => node.id.to_string()); } } } }
3. Handle Suspicion State
#![allow(unused)] fn main() { // Don't immediately fail on suspicion - investigate first let mut events = cluster.subscribe(); while let Some(event) = events.recv().await { match event { ClusterEvent::NodeSuspect(node, phi) => { log::warn!("Node {} suspect (Φ = {:.2}), investigating...", node.id, phi); // Trigger additional checks tokio::spawn(async move { if let Err(e) = verify_node_health(&node).await { log::error!("Node {} verification failed: {}", node.id, e); } }); } ClusterEvent::NodeFailed(node) => { log::error!("Node {} failed, removing from pool", node.id); remove_from_worker_pool(node.id).await; } _ => {} } } }
4. Adjust History Size
#![allow(unused)] fn main() { // Larger history = more stable, slower adaptation .with_history_size(200) // For very stable networks // Smaller history = faster adaptation to changes .with_history_size(50) // For dynamic networks // Default (recommended) .with_history_size(100) }
5. Set Minimum Standard Deviation
#![allow(unused)] fn main() { // Prevent division by zero and overly sensitive detection .with_min_std_deviation(Duration::from_millis(50)) // Higher min = less sensitive to small variations .with_min_std_deviation(Duration::from_millis(100)) }
Troubleshooting
False Positives (Node marked failed but is alive)
Symptoms:
- Nodes frequently marked failed and recovered
- Phi threshold exceeded during normal operation
Debug:
#![allow(unused)] fn main() { // Log phi values and intervals for node in cluster.nodes().await { let phi = cluster.phi(node.id).await.unwrap_or(0.0); let history = cluster.heartbeat_history(node.id).await; println!("Node {}: Φ = {:.2}, intervals = {:?}", node.id, phi, history); } }
Solutions:
- Increase phi threshold (8.0 → 10.0)
- Increase heartbeat interval to match network latency
- Increase min_std_deviation for variable networks
Slow Detection (Failures take too long to detect)
Symptoms:
- Nodes crash but stay marked alive for minutes
- Requests keep routing to failed nodes
Debug:
#![allow(unused)] fn main() { // Measure actual detection time let failure_time = Instant::now(); // ... node fails ... let detection_time = cluster.wait_for_failure(node_id).await; println!("Detection took: {:?}", detection_time.duration_since(failure_time)); }
Solutions:
- Decrease phi threshold (8.0 → 6.0)
- Decrease heartbeat interval (1s → 500ms)
- Decrease suspicion timeout
Memory Growth
Symptoms:
- Memory usage grows over time
- History buffers not bounded
Debug:
#![allow(unused)] fn main() { // Check history sizes for node in cluster.nodes().await { let history = cluster.heartbeat_history(node.id).await; println!("Node {}: {} intervals tracked", node.id, history.len()); } }
Solutions:
- Ensure history_size is set (default: 100)
- Verify old entries are removed
- Check for node ID leaks
Advanced Topics
Combining Multiple Detectors
Use Phi Accrual for heartbeats AND application-level health:
#![allow(unused)] fn main() { struct CompositeHealthCheck { phi_detector: PhiAccrualDetector, app_health: Arc<Mutex<HashMap<Uuid, bool>>>, } impl CompositeHealthCheck { async fn is_healthy(&self, node_id: Uuid) -> bool { // Both phi and application health must be good let phi = self.phi_detector.phi(node_id); let app_healthy = self.app_health.lock().await.get(&node_id).copied().unwrap_or(false); phi < PHI_THRESHOLD && app_healthy } } }
Weighted Phi Thresholds
Different thresholds for different node types:
#![allow(unused)] fn main() { fn get_phi_threshold(node: &Node) -> f64 { match node.tags.get("criticality") { Some("high") => 10.0, // Very conservative for critical nodes Some("low") => 6.0, // Aggressive for non-critical _ => 8.0, // Default } } }
Next Steps
References
- Phi Accrual Paper - Original algorithm
- Cassandra Failure Detection - Production implementation
- Akka Cluster Phi - Akka's usage