rvLLM: Rust Rewrite of vLLM Achieves Near-Parity Performance with 30s Faster Startup

Rust-Native LLM Inference Server Matches vLLM Performance

rvLLM, a from-scratch Rust rewrite of vLLM released March 27, 2026, achieves near-parity performance with the popular Python-based inference server while offering 5-second faster launch times. The project has gained 328 GitHub stars by eliminating Python from the serving hot path and providing explicit control over kernels, memory allocation, and startup behavior.

Performance Benchmarks Show 0.99x Parity at High Concurrency

Benchmarks on H100 SXM 80GB with Qwen2.5-7B (f16) demonstrate competitive performance:

Direct Engine (256 output tokens, N=32 concurrency): rvLLM achieves 3,170 tok/s vs vLLM's 3,197 tok/s (0.99x parity)
HTTP Serving (200 requests, concurrency 32, max_tokens=256): rvLLM reaches 3,419.4 tok/s with 2,685.2 ms average latency
Launch-to-finished lifecycle: rvLLM completes in 30.33s vs vLLM's 35.51s (5 seconds faster)

Performance ratios improve at higher concurrency levels, ranging from 0.79-0.91x at lower loads to near-parity at 32+ concurrent requests.

Rust Stack Enables Tighter Resource Control

By building the entire stack in Rust—server, worker, scheduler, and kernels—rvLLM avoids serialization bottlenecks and Global Interpreter Lock (GIL) overhead. Key technical features include:

Memory Control: Reserve-based VRAM sizing with explicit --gpu-memory-reserve-gb, --num-gpu-blocks, and --num-cpu-blocks flags
Kernel Discipline: 54 validated CUDA kernels with multiple decode paths (FusedDecode, cuBLAS GEMV, megakernel, persistent, FP8)
JIT Fused Kernels: A Rust PTX emitter generates shape-specialized kernels achieving 2-7.5x faster performance than hand-written CUDA on measured decode operations
Continuous CUDA graph replay with 35 pre-captured batch sizes and paged KV cache

rTriton and CPU Optimizations Expand Platform Support

rvLLM includes experimental features expanding beyond GPU inference:

rTriton: A Rust reimplementation of Triton with 30+ ops, 7 optimization passes, and cuBLAS integration—no Python dependency
Zig SIMD backend: Accelerates sampling primitives with 6.31x argmax speedup and 1.44x softmax gains on Apple M5
Quantization: FP8 weights and KV cache supported, with INT4 decode kernels ready

Trade-offs vs Reference vLLM Implementation

rvLLM currently lacks some optimizations present in vLLM:

GEMM tuning: vLLM uses Triton autotuned GEMMs while rvLLM relies on stock cuBLAS heuristics
Attention optimization: vLLM's FlashAttention-3 implementation is more optimized
Quantization breadth: vLLM supports GPTQ, AWQ, and Marlin formats beyond rvLLM's FP8 and planned INT4

Despite these gaps, rvLLM demonstrates that Rust-native serving can match vLLM's latency profile while offering safer memory semantics and faster lifecycle times for single-card, high-throughput deployments.

Key Takeaways

rvLLM achieves 0.99x performance parity with vLLM at high concurrency (3,170 vs 3,197 tok/s) while launching 5 seconds faster
Eliminates Python from serving hot path, building entire stack in Rust with 54 validated CUDA kernels and explicit memory control
JIT fused kernels achieve 2-7.5x faster performance than hand-written CUDA on decode operations
Includes experimental rTriton (Rust Triton reimplementation) and Zig SIMD backend for CPU acceleration
Gaps remain in GEMM tuning and quantization breadth compared to reference vLLM implementation

Rust-Native LLM Inference Server Matches vLLM Performance

Performance Benchmarks Show 0.99x Parity at High Concurrency

Benchmarks on H100 SXM 80GB with Qwen2.5-7B (f16) demonstrate competitive performance:

Direct Engine (256 output tokens, N=32 concurrency): rvLLM achieves 3,170 tok/s vs vLLM's 3,197 tok/s (0.99x parity)

HTTP Serving (200 requests, concurrency 32, max_tokens=256): rvLLM reaches 3,419.4 tok/s with 2,685.2 ms average latency

Launch-to-finished lifecycle: rvLLM completes in 30.33s vs vLLM's 35.51s (5 seconds faster)

Performance ratios improve at higher concurrency levels, ranging from 0.79-0.91x at lower loads to near-parity at 32+ concurrent requests.

Rust Stack Enables Tighter Resource Control

By building the entire stack in Rust—server, worker, scheduler, and kernels—rvLLM avoids serialization bottlenecks and Global Interpreter Lock (GIL) overhead. Key technical features include:

Memory Control: Reserve-based VRAM sizing with explicit --gpu-memory-reserve-gb, --num-gpu-blocks, and --num-cpu-blocks flags

Kernel Discipline: 54 validated CUDA kernels with multiple decode paths (FusedDecode, cuBLAS GEMV, megakernel, persistent, FP8)

JIT Fused Kernels: A Rust PTX emitter generates shape-specialized kernels achieving 2-7.5x faster performance than hand-written CUDA on measured decode operations

Continuous CUDA graph replay with 35 pre-captured batch sizes and paged KV cache

rTriton and CPU Optimizations Expand Platform Support

rvLLM includes experimental features expanding beyond GPU inference:

rTriton: A Rust reimplementation of Triton with 30+ ops, 7 optimization passes, and cuBLAS integration—no Python dependency

Zig SIMD backend: Accelerates sampling primitives with 6.31x argmax speedup and 1.44x softmax gains on Apple M5

Quantization: FP8 weights and KV cache supported, with INT4 decode kernels ready

Trade-offs vs Reference vLLM Implementation

rvLLM currently lacks some optimizations present in vLLM:

GEMM tuning: vLLM uses Triton autotuned GEMMs while rvLLM relies on stock cuBLAS heuristics

Attention optimization: vLLM's FlashAttention-3 implementation is more optimized

Quantization breadth: vLLM supports GPTQ, AWQ, and Marlin formats beyond rvLLM's FP8 and planned INT4

Key Takeaways

rvLLM achieves 0.99x performance parity with vLLM at high concurrency (3,170 vs 3,197 tok/s) while launching 5 seconds faster

Eliminates Python from serving hot path, building entire stack in Rust with 54 validated CUDA kernels and explicit memory control

JIT fused kernels achieve 2-7.5x faster performance than hand-written CUDA on decode operations

Includes experimental rTriton (Rust Triton reimplementation) and Zig SIMD backend for CPU acceleration

Gaps remain in GEMM tuning and quantization breadth compared to reference vLLM implementation