Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Performance Tuning

RpcNet achieves 172,000+ requests/second with proper configuration. This chapter provides concrete tips and techniques to maximize performance in production deployments.

Baseline Performance

Out-of-the-box performance with default settings:

MetricValueNotes
Throughput130K-150K RPSSingle director + 3 workers
Latency (P50)0.5-0.8msWith efficient connection handling
Latency (P99)2-5msUnder moderate load
CPU (per node)40-60%At peak throughput
Memory50-100MBPer worker node

Target after tuning: 172K+ RPS, < 0.5ms P50 latency, < 35% CPU

Quick Wins

1. Optimize Connection Management

Impact: Significant throughput increase, reduced latency

#![allow(unused)]
fn main() {
use rpcnet::cluster::ClusterClientConfig;

// Use built-in connection optimization
let config = ClusterClientConfig::default();
}

Why it works:

  • Efficient connection reuse
  • Reduces handshake overhead
  • Minimizes connection setup time

2. Use Least Connections Load Balancing

Impact: 15-20% throughput increase under variable load

#![allow(unused)]
fn main() {
use rpcnet::cluster::{WorkerRegistry, LoadBalancingStrategy};

// Before (Round Robin): uneven load distribution
let registry = WorkerRegistry::new(cluster, LoadBalancingStrategy::RoundRobin);

// After (Least Connections): optimal distribution
let registry = WorkerRegistry::new(cluster, LoadBalancingStrategy::LeastConnections);
}

Why it works:

  • Prevents overloading individual workers
  • Adapts to actual load in real-time
  • Handles heterogeneous workers better

3. Tune Gossip Interval

Impact: 10-15% CPU reduction, minimal latency impact

#![allow(unused)]
fn main() {
use rpcnet::cluster::ClusterConfig;

// Before (default 1s): higher CPU
let config = ClusterConfig::default()
    .with_gossip_interval(Duration::from_secs(1));

// After (2s for stable networks): lower CPU
let config = ClusterConfig::default()
    .with_gossip_interval(Duration::from_secs(2));
}

Why it works:

  • Gossip overhead scales with frequency
  • Stable networks don't need aggressive gossip
  • Failure detection still fast enough (4-8s)

4. Increase Worker Pool Size

Impact: Linear throughput scaling

#![allow(unused)]
fn main() {
// Before: 3 workers → 150K RPS
// After: 5 workers → 250K+ RPS

// Each worker adds ~50K RPS capacity
}

Guidelines:

  • Add workers until you hit network/director bottleneck
  • Monitor director CPU - scale director if > 80%
  • Ensure network bandwidth sufficient

Detailed Tuning

Connection Management Optimization

RpcNet handles connection management automatically, but you can optimize for your specific use case:

#![allow(unused)]
fn main() {
use rpcnet::cluster::ClusterClientConfig;

// Default configuration is optimized for most use cases
let config = ClusterClientConfig::default();
}

QUIC Tuning

Stream Limits

#![allow(unused)]
fn main() {
use rpcnet::ServerConfig;

let config = ServerConfig::builder()
    .with_max_concurrent_streams(100)  // More streams = higher throughput
    .with_max_stream_bandwidth(10 * 1024 * 1024)  // 10 MB/s per stream
    .build();
}

Guidelines:

  • max_concurrent_streams: Set to expected concurrent requests + 20%
  • max_stream_bandwidth: Set based on your largest message size

Congestion Control

#![allow(unused)]
fn main() {
// Aggressive (high-bandwidth networks)
.with_congestion_control(CongestionControl::Cubic)

// Conservative (variable networks)
.with_congestion_control(CongestionControl::NewReno)

// Recommended default
.with_congestion_control(CongestionControl::Bbr)  // Best overall
}

TLS Optimization

Session Resumption

#![allow(unused)]
fn main() {
// Enable TLS session tickets for 0-RTT
let config = ServerConfig::builder()
    .with_cert_and_key(cert, key)?
    .with_session_tickets_enabled(true)  // ← Enables 0-RTT
    .build();
}

Impact: First request after reconnect goes from 2-3 RTT to 0 RTT

Cipher Suite Selection

#![allow(unused)]
fn main() {
// Prefer fast ciphers (AES-GCM with hardware acceleration)
.with_cipher_suites(&[
    CipherSuite::TLS13_AES_128_GCM_SHA256,  // Fast with AES-NI
    CipherSuite::TLS13_CHACHA20_POLY1305_SHA256,  // Good for ARM
])
}

Message Serialization

Use Efficient Formats

#![allow(unused)]
fn main() {
// Fastest: bincode (binary)
use bincode;
let bytes = bincode::serialize(&data)?;

// Fast: rmp-serde (MessagePack)
use rmp_serde;
let bytes = rmp_serde::to_vec(&data)?;

// Slower: serde_json (human-readable, but slower)
let bytes = serde_json::to_vec(&data)?;
}

Benchmark (10KB struct):

FormatSerializeDeserializeSize
bincode12 μs18 μs10240 bytes
MessagePack28 μs35 μs9800 bytes
JSON85 μs120 μs15300 bytes

Minimize Allocations

#![allow(unused)]
fn main() {
// ❌ Bad: Multiple allocations
fn build_request(id: u64, data: Vec<u8>) -> Request {
    Request {
        id: id.to_string(),  // Allocation
        timestamp: SystemTime::now(),
        payload: format!("data-{}", String::from_utf8_lossy(&data)),  // Multiple allocations
    }
}

// ✅ Good: Reuse buffers
fn build_request(id: u64, data: &[u8], buffer: &mut Vec<u8>) -> Request {
    buffer.clear();
    buffer.extend_from_slice(b"data-");
    buffer.extend_from_slice(data);
    
    Request {
        id,
        timestamp: SystemTime::now(),
        payload: buffer.clone(),  // Single allocation
    }
}
}

Platform-Specific Optimizations

Linux

UDP/QUIC Tuning

# Increase network buffer sizes
sudo sysctl -w net.core.rmem_max=536870912
sudo sysctl -w net.core.wmem_max=536870912
sudo sysctl -w net.ipv4.tcp_rmem='4096 87380 536870912'
sudo sysctl -w net.ipv4.tcp_wmem='4096 87380 536870912'

# Increase UDP buffer (QUIC uses UDP)
sudo sysctl -w net.core.netdev_max_backlog=5000

# Increase connection tracking
sudo sysctl -w net.netfilter.nf_conntrack_max=1000000

# Make permanent: add to /etc/sysctl.conf

CPU Affinity

#![allow(unused)]
fn main() {
use core_affinity;

// Pin worker threads to specific CPUs
fn pin_to_core(core_id: usize) {
    let core_ids = core_affinity::get_core_ids().unwrap();
    core_affinity::set_for_current(core_ids[core_id]);
}

// Usage in worker startup
tokio::task::spawn_blocking(|| {
    pin_to_core(0);  // Pin to CPU 0
    // Worker processing logic
});
}

macOS

Increase File Descriptors

# Check current limits
ulimit -n

# Increase (temporary)
ulimit -n 65536

# Make permanent: add to ~/.zshrc or ~/.bash_profile
echo "ulimit -n 65536" >> ~/.zshrc

Profiling and Monitoring

CPU Profiling

# Install perf (Linux)
sudo apt install linux-tools-common linux-tools-generic

# Profile RpcNet application
sudo perf record -F 99 -a -g -- cargo run --release --bin worker
sudo perf report

# Identify hot paths and optimize

Memory Profiling

# Use valgrind for memory analysis
cargo build --release
valgrind --tool=massif --massif-out-file=massif.out ./target/release/worker

# Visualize with massif-visualizer
ms_print massif.out

Tokio Console

# Add to Cargo.toml
[dependencies]
console-subscriber = "0.2"
#![allow(unused)]
fn main() {
// In main.rs
console_subscriber::init();

// Run application and connect with tokio-console
// cargo install tokio-console
// tokio-console
}

Benchmarking

Throughput Test

#![allow(unused)]
fn main() {
use std::time::Instant;

async fn benchmark_throughput(client: Arc<ClusterClient>, duration_secs: u64) {
    let start = Instant::now();
    let mut count = 0;
    
    while start.elapsed().as_secs() < duration_secs {
        match client.call_worker("compute", vec![], Some("role=worker")).await {
            Ok(_) => count += 1,
            Err(e) => eprintln!("Request failed: {}", e),
        }
    }
    
    let elapsed = start.elapsed().as_secs_f64();
    let rps = count as f64 / elapsed;
    
    println!("Throughput: {:.0} requests/second", rps);
    println!("Total requests: {}", count);
    println!("Duration: {:.2}s", elapsed);
}
}

Latency Test

#![allow(unused)]
fn main() {
use hdrhistogram::Histogram;

async fn benchmark_latency(client: Arc<ClusterClient>, num_requests: usize) {
    let mut histogram = Histogram::<u64>::new(3).unwrap();
    
    for _ in 0..num_requests {
        let start = Instant::now();
        let _ = client.call_worker("compute", vec![], Some("role=worker")).await;
        let latency_us = start.elapsed().as_micros() as u64;
        histogram.record(latency_us).unwrap();
    }
    
    println!("Latency percentiles (μs):");
    println!("  P50:  {}", histogram.value_at_quantile(0.50));
    println!("  P90:  {}", histogram.value_at_quantile(0.90));
    println!("  P99:  {}", histogram.value_at_quantile(0.99));
    println!("  P99.9: {}", histogram.value_at_quantile(0.999));
    println!("  Max:  {}", histogram.max());
}
}

Load Test Script

#![allow(unused)]
fn main() {
// Concurrent load test
async fn load_test(
    client: Arc<ClusterClient>,
    num_concurrent: usize,
    requests_per_task: usize,
) {
    let start = Instant::now();
    
    let tasks: Vec<_> = (0..num_concurrent)
        .map(|_| {
            let client = client.clone();
            tokio::spawn(async move {
                for _ in 0..requests_per_task {
                    let _ = client.call_worker("compute", vec![], Some("role=worker")).await;
                }
            })
        })
        .collect();
    
    for task in tasks {
        task.await.unwrap();
    }
    
    let elapsed = start.elapsed().as_secs_f64();
    let total_requests = num_concurrent * requests_per_task;
    let rps = total_requests as f64 / elapsed;
    
    println!("Load test results:");
    println!("  Concurrency: {}", num_concurrent);
    println!("  Total requests: {}", total_requests);
    println!("  Duration: {:.2}s", elapsed);
    println!("  Throughput: {:.0} RPS", rps);
}
}

Performance Checklist

Before Production

  • Use default connection management (already optimized)
  • Use Least Connections load balancing
  • Tune gossip interval for your network
  • Configure QUIC stream limits
  • Enable TLS session resumption
  • Profile with release build (--release)
  • Test under expected peak load
  • Monitor CPU, memory, network utilization
  • Set up latency tracking (P50, P99, P99.9)
  • Configure OS-level network tuning

Monitoring in Production

#![allow(unused)]
fn main() {
// Essential metrics to track
metrics::gauge!("rpc.throughput_rps", current_rps);
metrics::gauge!("rpc.latency_p50_us", latency_p50);
metrics::gauge!("rpc.latency_p99_us", latency_p99);
metrics::gauge!("rpc.cpu_usage_pct", cpu_usage);
metrics::gauge!("rpc.memory_mb", memory_mb);
metrics::gauge!("pool.hit_rate", pool_hit_rate);
metrics::gauge!("cluster.healthy_workers", healthy_count);
}

Troubleshooting Performance Issues

High Latency

Symptoms: P99 latency > 10ms

Debug:

#![allow(unused)]
fn main() {
// Add timing to identify bottleneck
let start = Instant::now();

let select_time = Instant::now();
let worker = registry.select_worker(Some("role=worker")).await?;
println!("Worker selection: {:?}", select_time.elapsed());

let connect_time = Instant::now();
let conn = pool.get_or_connect(worker.addr).await?;
println!("Connection: {:?}", connect_time.elapsed());

let call_time = Instant::now();
let result = conn.call("compute", data).await?;
println!("RPC call: {:?}", call_time.elapsed());

println!("Total: {:?}", start.elapsed());
}

Common causes:

  • Connection management issues (check network configuration)
  • Slow workers (check worker CPU/memory)
  • Network latency (move closer or add local workers)

Low Throughput

Symptoms: < 100K RPS with multiple workers

Debug:

#![allow(unused)]
fn main() {
// Check bottlenecks
println!("Pool metrics: {:?}", pool.metrics());
println!("Worker count: {}", registry.worker_count().await);
println!("Active connections: {}", pool.active_connections());
}

Common causes:

  • Too few workers (add more)
  • Network connectivity issues (check network configuration)
  • Director CPU saturated (scale director)
  • Network bandwidth limit (upgrade network)

High CPU Usage

Symptoms: > 80% CPU at low load

Debug:

# Profile with perf
sudo perf record -F 99 -a -g -- cargo run --release
sudo perf report

# Look for hot functions

Common causes:

  • Too frequent gossip (increase interval)
  • Excessive serialization (optimize message format)
  • Inefficient connection handling (use latest RpcNet version)
  • Debug build instead of release

Real-World Results

Case Study: Video Transcoding Cluster

Setup:

  • 1 director
  • 10 GPU workers
  • 1000 concurrent clients

Before tuning: 45K RPS, 15ms P99 latency
After tuning: 180K RPS, 2ms P99 latency

Changes:

  1. Used optimized connection management
  2. Tuned gossip interval (1s → 2s)
  3. Used Least Connections strategy
  4. Optimized message serialization (JSON → bincode)

Next Steps

References