Performance Tuning

RpcNet achieves 172,000+ requests/second with proper configuration. This chapter provides concrete tips and techniques to maximize performance in production deployments.

Baseline Performance

Out-of-the-box performance with default settings:

Metric	Value	Notes
Throughput	130K-150K RPS	Single director + 3 workers
Latency (P50)	0.5-0.8ms	With efficient connection handling
Latency (P99)	2-5ms	Under moderate load
CPU (per node)	40-60%	At peak throughput
Memory	50-100MB	Per worker node

Target after tuning: 172K+ RPS, < 0.5ms P50 latency, < 35% CPU

Quick Wins

1. Optimize Connection Management

Impact: Significant throughput increase, reduced latency

#![allow(unused)]
fn main() {
use rpcnet::cluster::ClusterClientConfig;

// Use built-in connection optimization
let config = ClusterClientConfig::default();
}

Why it works:

Efficient connection reuse
Reduces handshake overhead
Minimizes connection setup time

2. Use Least Connections Load Balancing

Impact: 15-20% throughput increase under variable load

#![allow(unused)]
fn main() {
use rpcnet::cluster::{WorkerRegistry, LoadBalancingStrategy};

// Before (Round Robin): uneven load distribution
let registry = WorkerRegistry::new(cluster, LoadBalancingStrategy::RoundRobin);

// After (Least Connections): optimal distribution
let registry = WorkerRegistry::new(cluster, LoadBalancingStrategy::LeastConnections);
}

Why it works:

Prevents overloading individual workers
Adapts to actual load in real-time
Handles heterogeneous workers better

3. Tune Gossip Interval

Impact: 10-15% CPU reduction, minimal latency impact

#![allow(unused)]
fn main() {
use rpcnet::cluster::ClusterConfig;

// Before (default 1s): higher CPU
let config = ClusterConfig::default()
    .with_gossip_interval(Duration::from_secs(1));

// After (2s for stable networks): lower CPU
let config = ClusterConfig::default()
    .with_gossip_interval(Duration::from_secs(2));
}

Why it works:

Gossip overhead scales with frequency
Stable networks don't need aggressive gossip
Failure detection still fast enough (4-8s)

4. Increase Worker Pool Size

Impact: Linear throughput scaling

#![allow(unused)]
fn main() {
// Before: 3 workers → 150K RPS
// After: 5 workers → 250K+ RPS

// Each worker adds ~50K RPS capacity
}

Guidelines:

Add workers until you hit network/director bottleneck
Monitor director CPU - scale director if > 80%
Ensure network bandwidth sufficient

Detailed Tuning

Connection Management Optimization

RpcNet handles connection management automatically, but you can optimize for your specific use case:

#![allow(unused)]
fn main() {
use rpcnet::cluster::ClusterClientConfig;

// Default configuration is optimized for most use cases
let config = ClusterClientConfig::default();
}

QUIC Tuning

Stream Limits

#![allow(unused)]
fn main() {
use rpcnet::ServerConfig;

let config = ServerConfig::builder()
    .with_max_concurrent_streams(100)  // More streams = higher throughput
    .with_max_stream_bandwidth(10 * 1024 * 1024)  // 10 MB/s per stream
    .build();
}

Guidelines:

max_concurrent_streams: Set to expected concurrent requests + 20%
max_stream_bandwidth: Set based on your largest message size

Congestion Control

#![allow(unused)]
fn main() {
// Aggressive (high-bandwidth networks)
.with_congestion_control(CongestionControl::Cubic)

// Conservative (variable networks)
.with_congestion_control(CongestionControl::NewReno)

// Recommended default
.with_congestion_control(CongestionControl::Bbr)  // Best overall
}

TLS Optimization

Session Resumption

#![allow(unused)]
fn main() {
// Enable TLS session tickets for 0-RTT
let config = ServerConfig::builder()
    .with_cert_and_key(cert, key)?
    .with_session_tickets_enabled(true)  // ← Enables 0-RTT
    .build();
}

Impact: First request after reconnect goes from 2-3 RTT to 0 RTT

Cipher Suite Selection

#![allow(unused)]
fn main() {
// Prefer fast ciphers (AES-GCM with hardware acceleration)
.with_cipher_suites(&[
    CipherSuite::TLS13_AES_128_GCM_SHA256,  // Fast with AES-NI
    CipherSuite::TLS13_CHACHA20_POLY1305_SHA256,  // Good for ARM
])
}

Message Serialization

Use Efficient Formats

#![allow(unused)]
fn main() {
// Fastest: bincode (binary)
use bincode;
let bytes = bincode::serialize(&data)?;

// Fast: rmp-serde (MessagePack)
use rmp_serde;
let bytes = rmp_serde::to_vec(&data)?;

// Slower: serde_json (human-readable, but slower)
let bytes = serde_json::to_vec(&data)?;
}

Benchmark (10KB struct):

Format	Serialize	Deserialize	Size
bincode	12 μs	18 μs	10240 bytes
MessagePack	28 μs	35 μs	9800 bytes
JSON	85 μs	120 μs	15300 bytes

Minimize Allocations

#![allow(unused)]
fn main() {
// ❌ Bad: Multiple allocations
fn build_request(id: u64, data: Vec<u8>) -> Request {
    Request {
        id: id.to_string(),  // Allocation
        timestamp: SystemTime::now(),
        payload: format!("data-{}", String::from_utf8_lossy(&data)),  // Multiple allocations
    }
}

// ✅ Good: Reuse buffers
fn build_request(id: u64, data: &[u8], buffer: &mut Vec<u8>) -> Request {
    buffer.clear();
    buffer.extend_from_slice(b"data-");
    buffer.extend_from_slice(data);
    
    Request {
        id,
        timestamp: SystemTime::now(),
        payload: buffer.clone(),  // Single allocation
    }
}
}

Platform-Specific Optimizations

Linux

UDP/QUIC Tuning

# Increase network buffer sizes
sudo sysctl -w net.core.rmem_max=536870912
sudo sysctl -w net.core.wmem_max=536870912
sudo sysctl -w net.ipv4.tcp_rmem='4096 87380 536870912'
sudo sysctl -w net.ipv4.tcp_wmem='4096 87380 536870912'

# Increase UDP buffer (QUIC uses UDP)
sudo sysctl -w net.core.netdev_max_backlog=5000

# Increase connection tracking
sudo sysctl -w net.netfilter.nf_conntrack_max=1000000

# Make permanent: add to /etc/sysctl.conf

CPU Affinity

#![allow(unused)]
fn main() {
use core_affinity;

// Pin worker threads to specific CPUs
fn pin_to_core(core_id: usize) {
    let core_ids = core_affinity::get_core_ids().unwrap();
    core_affinity::set_for_current(core_ids[core_id]);
}

// Usage in worker startup
tokio::task::spawn_blocking(|| {
    pin_to_core(0);  // Pin to CPU 0
    // Worker processing logic
});
}

macOS

Increase File Descriptors

# Check current limits
ulimit -n

# Increase (temporary)
ulimit -n 65536

# Make permanent: add to ~/.zshrc or ~/.bash_profile
echo "ulimit -n 65536" >> ~/.zshrc

Profiling and Monitoring

CPU Profiling

# Install perf (Linux)
sudo apt install linux-tools-common linux-tools-generic

# Profile RpcNet application
sudo perf record -F 99 -a -g -- cargo run --release --bin worker
sudo perf report

# Identify hot paths and optimize

Memory Profiling

# Use valgrind for memory analysis
cargo build --release
valgrind --tool=massif --massif-out-file=massif.out ./target/release/worker

# Visualize with massif-visualizer
ms_print massif.out

Tokio Console

# Add to Cargo.toml
[dependencies]
console-subscriber = "0.2"

#![allow(unused)]
fn main() {
// In main.rs
console_subscriber::init();

// Run application and connect with tokio-console
// cargo install tokio-console
// tokio-console
}

Benchmarking

Throughput Test

#![allow(unused)]
fn main() {
use std::time::Instant;

async fn benchmark_throughput(client: Arc<ClusterClient>, duration_secs: u64) {
    let start = Instant::now();
    let mut count = 0;
    
    while start.elapsed().as_secs() < duration_secs {
        match client.call_worker("compute", vec![], Some("role=worker")).await {
            Ok(_) => count += 1,
            Err(e) => eprintln!("Request failed: {}", e),
        }
    }
    
    let elapsed = start.elapsed().as_secs_f64();
    let rps = count as f64 / elapsed;
    
    println!("Throughput: {:.0} requests/second", rps);
    println!("Total requests: {}", count);
    println!("Duration: {:.2}s", elapsed);
}
}

Latency Test

#![allow(unused)]
fn main() {
use hdrhistogram::Histogram;

async fn benchmark_latency(client: Arc<ClusterClient>, num_requests: usize) {
    let mut histogram = Histogram::<u64>::new(3).unwrap();
    
    for _ in 0..num_requests {
        let start = Instant::now();
        let _ = client.call_worker("compute", vec![], Some("role=worker")).await;
        let latency_us = start.elapsed().as_micros() as u64;
        histogram.record(latency_us).unwrap();
    }
    
    println!("Latency percentiles (μs):");
    println!("  P50:  {}", histogram.value_at_quantile(0.50));
    println!("  P90:  {}", histogram.value_at_quantile(0.90));
    println!("  P99:  {}", histogram.value_at_quantile(0.99));
    println!("  P99.9: {}", histogram.value_at_quantile(0.999));
    println!("  Max:  {}", histogram.max());
}
}

Load Test Script

#![allow(unused)]
fn main() {
// Concurrent load test
async fn load_test(
    client: Arc<ClusterClient>,
    num_concurrent: usize,
    requests_per_task: usize,
) {
    let start = Instant::now();
    
    let tasks: Vec<_> = (0..num_concurrent)
        .map(|_| {
            let client = client.clone();
            tokio::spawn(async move {
                for _ in 0..requests_per_task {
                    let _ = client.call_worker("compute", vec![], Some("role=worker")).await;
                }
            })
        })
        .collect();
    
    for task in tasks {
        task.await.unwrap();
    }
    
    let elapsed = start.elapsed().as_secs_f64();
    let total_requests = num_concurrent * requests_per_task;
    let rps = total_requests as f64 / elapsed;
    
    println!("Load test results:");
    println!("  Concurrency: {}", num_concurrent);
    println!("  Total requests: {}", total_requests);
    println!("  Duration: {:.2}s", elapsed);
    println!("  Throughput: {:.0} RPS", rps);
}
}

Performance Checklist

Before Production

Use default connection management (already optimized)
Use Least Connections load balancing
Tune gossip interval for your network
Configure QUIC stream limits
Enable TLS session resumption
Profile with release build (--release)
Test under expected peak load
Monitor CPU, memory, network utilization
Set up latency tracking (P50, P99, P99.9)
Configure OS-level network tuning

Monitoring in Production

#![allow(unused)]
fn main() {
// Essential metrics to track
metrics::gauge!("rpc.throughput_rps", current_rps);
metrics::gauge!("rpc.latency_p50_us", latency_p50);
metrics::gauge!("rpc.latency_p99_us", latency_p99);
metrics::gauge!("rpc.cpu_usage_pct", cpu_usage);
metrics::gauge!("rpc.memory_mb", memory_mb);
metrics::gauge!("pool.hit_rate", pool_hit_rate);
metrics::gauge!("cluster.healthy_workers", healthy_count);
}

Troubleshooting Performance Issues

High Latency

Symptoms: P99 latency > 10ms

Debug:

#![allow(unused)]
fn main() {
// Add timing to identify bottleneck
let start = Instant::now();

let select_time = Instant::now();
let worker = registry.select_worker(Some("role=worker")).await?;
println!("Worker selection: {:?}", select_time.elapsed());

let connect_time = Instant::now();
let conn = pool.get_or_connect(worker.addr).await?;
println!("Connection: {:?}", connect_time.elapsed());

let call_time = Instant::now();
let result = conn.call("compute", data).await?;
println!("RPC call: {:?}", call_time.elapsed());

println!("Total: {:?}", start.elapsed());
}

Common causes:

Connection management issues (check network configuration)
Slow workers (check worker CPU/memory)
Network latency (move closer or add local workers)

Low Throughput

Symptoms: < 100K RPS with multiple workers

Debug:

#![allow(unused)]
fn main() {
// Check bottlenecks
println!("Pool metrics: {:?}", pool.metrics());
println!("Worker count: {}", registry.worker_count().await);
println!("Active connections: {}", pool.active_connections());
}

Common causes:

Too few workers (add more)
Network connectivity issues (check network configuration)
Director CPU saturated (scale director)
Network bandwidth limit (upgrade network)

High CPU Usage

Symptoms: > 80% CPU at low load

Debug:

# Profile with perf
sudo perf record -F 99 -a -g -- cargo run --release
sudo perf report

# Look for hot functions

Common causes:

Too frequent gossip (increase interval)
Excessive serialization (optimize message format)
Inefficient connection handling (use latest RpcNet version)
Debug build instead of release

Real-World Results

Case Study: Video Transcoding Cluster

Setup:

1 director
10 GPU workers
1000 concurrent clients

Before tuning: 45K RPS, 15ms P99 latency
After tuning: 180K RPS, 2ms P99 latency

Changes:

Used optimized connection management
Tuned gossip interval (1s → 2s)
Used Least Connections strategy
Optimized message serialization (JSON → bincode)

Next Steps

Production Guide - Deploy optimized clusters
Load Balancing - Strategy selection

References

QUIC Performance - Protocol optimizations
Linux Network Tuning - OS-level tuning
Tokio Performance - Async runtime tips

Keyboard shortcuts

RpcNet Guide