Researchers at Doubleword AI have introduced a novel method for compressing KV cache in large language models that preserves exact values while reducing memory footprint by approximately 4×. Published May 8, 2026 by Fergus Finn and featured on Hacker News with 103 points and 15 comments, the technique uses arithmetic coding with predictive models to achieve lossless compression.
Predictor Model Runs in Parallel to Enable Efficient Encoding
The method uses a cheaper predictor model running in parallel on both encoding and decoding sides. The predictor generates per-scalar predictions (μ, σ) representing the expected KV cache values and their uncertainty. An arithmetic coder then encodes the true cache at a bitrate determined by prediction accuracy, following the formula H(p, q) = H(p) + KL(p ∥ q) bits per symbol, where the KL divergence represents overhead from imperfect predictions.
The simplest implementation uses an FP8-quantized version of the target model itself as predictor, avoiding additional training overhead. Distribution modeling uses a three-component mixture that outperforms single Gaussians, combining standard Gaussians and empirical distributions to handle outliers effectively.
Compression Ratios Range from 2.37× to 3.90× Depending on Precision
On bf16 caches, the method achieves 2.37× to 2.70× compression across model sizes from 0.6B to 32B parameters. On FP8 caches, compression improves to 3.08× to 3.90× across the same range. Combined with the initial quantization, this yields 6× to 8× total compression on original bf16 cache. Notably, bitrate decreases monotonically with model size—larger targets compress better by approximately 0.9 bits per scalar from 0.6B to 32B parameters.
Lossless Compression Enables 4× Longer Contexts in Same Memory
Unlike quantization or pruning which discard information, this method uses information theory to encode the exact cache values in fewer bits by exploiting predictability. The decoded cache is bit-for-bit identical to the original. For long-context LLM inference, where KV cache memory is a major bottleneck, 4× compression means 4× longer contexts in the same memory, or 4× more concurrent users on the same hardware.
The method works with any decoder-only transformer and requires no model retraining, making it immediately practical for deployment. The research addresses a critical infrastructure challenge as context windows continue to expand and inference costs remain dominated by memory bandwidth rather than compute.
Key Takeaways
- Speculative KV coding achieves 2.37× to 3.90× lossless compression of KV cache using arithmetic coding with predictive models
- The method uses an FP8-quantized version of the target model as predictor, requiring no additional training
- Combined with FP8 quantization, total compression reaches 6× to 8× on original bf16 cache
- Larger models compress better, with approximately 0.9 bits per scalar improvement from 0.6B to 32B parameters
- The decoded cache is bit-for-bit identical to the original, enabling 4× longer contexts in the same memory footprint