Cluster Overview
RpcNet provides built-in support for building distributed RPC clusters with automatic service discovery, intelligent load balancing, and robust failure detection. This chapter introduces the core concepts and components of RpcNet's cluster architecture.
What is a Cluster?
A cluster in RpcNet is a group of interconnected nodes that work together to provide distributed RPC services. Nodes automatically discover each other, share information about their state, and coordinate to handle client requests efficiently.
Key Benefits
Automatic Discovery π
- No manual node registration required
- Nodes join and leave seamlessly
- Gossip protocol spreads information automatically
Intelligent Load Balancing βοΈ
- Multiple strategies (Round Robin, Random, Least Connections)
- Tracks active connections per node
- Prevents overload on individual nodes
Robust Failure Detection π
- Phi Accrual failure detection algorithm
- Adapts to network conditions
- Distinguishes between slow and failed nodes
Tag-Based Routing π·οΈ
- Route requests by node capabilities
- Filter by zone, hardware type, role, etc.
- Enables heterogeneous worker pools
Architecture Components
RpcNet's cluster architecture consists of several key components that work together:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Application Layer β
β (Your RPC handlers, business logic) β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ
β ClusterClient β
β - High-level API for cluster operations β
β - Load-balanced request routing β
β - Efficient request routing β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
β
βββββββββΌββββββββββ
β WorkerRegistry β
β - Tracks nodes β
β - Load balance β
β - Filter tags β
βββββββββ¬ββββββββββ
β
βββββββββΌββββββββββ
β NodeRegistry β
β - All nodes β
β - Health state β
β - Metadata β
βββββββββ¬ββββββββββ
β
βββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ClusterMembership (SWIM) β
β - Gossip protocol for node discovery β
β - Phi Accrual failure detection β
β - Event notifications (NodeJoined/Left/Failed) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1. ClusterMembership (SWIM)
The foundation of RpcNet's cluster is the SWIM (Scalable Weakly-consistent Infection-style Process Group Membership) protocol. This provides:
- Gossip-based communication: Nodes periodically exchange information
- Failure detection: Phi Accrual algorithm detects node failures accurately
- Partition detection: Identifies network splits and handles them gracefully
- Event system: Notifies about node state changes
Key characteristics:
- Eventually consistent membership information
- Scales to thousands of nodes
- Low network overhead (UDP-based gossip)
- Handles network partitions and node churn
2. NodeRegistry
The NodeRegistry maintains a comprehensive view of all nodes in the cluster:
#![allow(unused)] fn main() { use rpcnet::cluster::{NodeRegistry, ClusterMembership}; let registry = Arc::new(NodeRegistry::new(cluster)); registry.start().await; // Get all nodes let nodes = registry.nodes().await; // Subscribe to cluster events let mut events = registry.subscribe(); while let Some(event) = events.recv().await { match event { ClusterEvent::NodeJoined(node) => println!("Node joined: {}", node.id), ClusterEvent::NodeLeft(node) => println!("Node left: {}", node.id), ClusterEvent::NodeFailed(node) => println!("Node failed: {}", node.id), } } }
Features:
- Real-time node tracking
- Metadata storage per node
- Event subscription for state changes
- Thread-safe access via
Arc
3. WorkerRegistry
The WorkerRegistry extends NodeRegistry to track worker nodes specifically:
#![allow(unused)] fn main() { use rpcnet::cluster::{WorkerRegistry, LoadBalancingStrategy}; let registry = Arc::new(WorkerRegistry::new( cluster, LoadBalancingStrategy::LeastConnections )); registry.start().await; // Select a worker (with optional tag filter) let worker = registry.select_worker(Some("role=worker")).await?; println!("Selected worker: {} at {}", worker.label, worker.addr); }
Features:
- Filters nodes by tags (e.g.,
role=worker
) - Applies load balancing strategy
- Tracks active connections per worker
- Automatic removal of failed workers
4. ClusterClient
The ClusterClient provides a high-level API that combines all components:
#![allow(unused)] fn main() { use rpcnet::cluster::{ClusterClient, ClusterClientConfig}; let client = Arc::new(ClusterClient::new(registry, config)); // Call any worker matching the filter let result = client.call_worker("compute", request, Some("role=worker")).await?; }
Features:
- Automatic worker selection
- Load-balanced request routing
- Efficient connection management
- Retry logic for failed requests
When to Use Clusters
RpcNet clusters are ideal for scenarios where you need:
β Good Use Cases
Distributed Workload Processing
- Multiple workers processing tasks in parallel
- Automatic load distribution across workers
- Example: Video transcoding farm, data processing pipeline
High Availability Services
- Services that must tolerate node failures
- Automatic failover to healthy nodes
- Example: API gateway, microservices mesh
Dynamic Scaling
- Add/remove nodes based on load
- Automatic discovery of new capacity
- Example: Auto-scaling worker pools, elastic compute clusters
Heterogeneous Worker Pools
- Different node types (GPU vs CPU, different zones)
- Tag-based routing to appropriate nodes
- Example: ML inference with GPU/CPU workers, multi-region deployments
β When NOT to Use Clusters
Single Node Deployments
- If you only have one server, use direct RPC instead
- Cluster overhead isn't justified
Strict Consistency Requirements
- SWIM provides eventual consistency
- Not suitable for strong consistency needs (use consensus protocols like Raft)
Low-Latency Single-Hop
- Direct RPC is faster for single client-server communication
- Cluster adds minimal overhead, but every bit counts for ultra-low latency
Cluster Modes
RpcNet supports different cluster deployment patterns:
1. Coordinator-Worker Pattern
One or more coordinator nodes route requests to worker nodes:
ββββββββββββββββ
β Coordinator β
β (Director) β
ββββββββ¬ββββββββ
β
βββββββββββββΌββββββββββββ
β β β
βββββΌββββ ββββΌβββββ ββββΌβββββ
βWorker β βWorker β βWorker β
βββββββββ βββββββββ βββββββββ
Use when:
- Clients don't need to track worker pool
- Centralized routing and monitoring
- Example: Load balancer + worker pool
2. Peer-to-Peer Pattern
All nodes are equal and can route to each other:
ββββββββ ββββββββ
β Node βββββββ€ Node β
βββββ¬βββ ββββ¬ββββ
β β
βββββββ¬ββββββ
βββββΌββββ
β Node β
βββββββββ
Use when:
- No single point of coordination needed
- Nodes serve both as clients and servers
- Example: Distributed cache, gossip-based database
3. Hierarchical Pattern
Multiple layers with different roles:
ββββββββββ
β Master β
βββββ¬βββββ
β
ββββββββΌβββββββ
βββββΌββββ βββββΌββββ
βRegion β βRegion β
βLeader β βLeader β
βββββ¬ββββ βββββ¬ββββ
β β
βββββΌββββ βββββΌββββ
βWorker β βWorker β
βββββββββ βββββββββ
Use when:
- Multi-region deployments
- Different node tiers (leaders, workers, storage)
- Example: Global CDN, multi-tenant systems
Performance Characteristics
RpcNet clusters maintain high performance while providing distributed coordination:
Throughput
- 172K+ requests/second in benchmarks
- Minimal overhead compared to direct RPC
- Scales linearly with number of workers
Latency
- < 0.1ms additional latency for load balancing
- Efficient connection handling reduces overhead
- QUIC's 0-RTT mode for warm connections
Scalability
- Tested with 1000+ nodes in gossip cluster
- Sub-linear gossip overhead (O(log N) per node)
- Configurable gossip intervals for tuning
Resource Usage
- Low memory: ~10KB per tracked node
- Low CPU: < 1% for gossip maintenance
- Low network: ~1KB/s per node for gossip
Next Steps
Now that you understand the cluster architecture, you can:
- Follow the Tutorial - Build your first cluster step-by-step
- Learn About Discovery - Deep dive into SWIM gossip protocol
- Explore Load Balancing - Choose the right strategy
- Understand Health Checking - How Phi Accrual works
- Handle Failures - Partition detection and recovery
Or jump directly to the Cluster Example to see a complete working system.