Researchers Ali Kayyam, Anusha Madan Gopal, and M Anthony Lewis published a paper on June 4, 2026, questioning whether all three Query, Key, and Value projections are necessary in transformer attention mechanisms. The work reached Hacker News with 66 points by demonstrating that shared projections achieve comparable performance while dramatically reducing memory requirements.
Q-K=V Variant Achieves 50 Percent KV Cache Reduction
The paper systematically evaluates three projection-sharing variants: Q-K=V where keys and values share a projection, Q=K-V where queries and keys share a projection, and Q=K=V with a single unified projection. Testing across synthetic problems, computer vision tasks including MNIST and CIFAR, and language models up to 1.2 billion parameters, the variants perform on par or occasionally better than standard QKV transformers.
The Q-K=V configuration alone achieves 50 percent KV cache reduction with only 3.1 percent perplexity degradation. When combined with Grouped Query Attention using 4 groups, cache reduction reaches 87.5 percent. Combining Q-K=V with Multi-Query Attention enables 96.9 percent cache reduction, making billion-parameter models feasible for on-device inference.
Low-Rank Attention Regime Enables Projection Sharing
The researchers argue that keys and values can occupy similar representational spaces because attention operates in a low-rank regime. This mathematical property suggests the separate K and V projections may be redundant for many tasks. The paper includes ablation studies showing Q-K=V outperforms Q=K-V or Q=K=V for most evaluated tasks, with 2D positional encodings mitigating symmetry problems in Q=K variants.
Performance holds across different model scales up to 1.2 billion parameters and generalizes across modalities including vision and language tasks. The approach requires no architectural changes beyond projection sharing, making it compatible with existing optimization techniques like Grouped Query Attention and Multi-Query Attention for multiplicative benefits.
Edge Deployment Benefits from Dramatic Memory Savings
For production LLM deployments, 50 percent KV cache reduction through Q-K=V could halve memory requirements for inference. The 96.9 percent reduction achieved by combining Q-K=V with Multi-Query Attention enables fitting 30 times more context in the same memory footprint. This proves particularly valuable for edge inference where memory represents the primary bottleneck.
Hacker News discussion highlighted excitement about memory savings for edge deployment and questions about interaction with recent KV-cache quantization work. Commenters discussed whether the approach could be retrofitted to existing trained models and noted the mathematical elegance of the low-rank attention argument. Interest focused on practical applications for mobile and IoT devices where resource constraints limit deployment of large language models.
Open Questions Remain for Larger Scale Models
The research tested models up to 1.2 billion parameters, leaving unclear whether findings hold for models exceeding 100 billion parameters. Evaluation on standard benchmarks provides no insight into impact on specialized tasks like code generation or mathematical reasoning. The paper does not analyze training dynamics, convergence speed, or effects on model interpretability and emergent capabilities.
This work complements existing memory optimization techniques including Grouped Query Attention, Multi-Query Attention, and KV-cache quantization methods. The combination of Q-K=V with MQA and quantization could enable extremely efficient inference as LLMs move to edge devices. The research suggests transformer attention mechanisms may be over-parameterized, with simpler architectures achieving similar performance at dramatically lower resource requirements.
Key Takeaways
- Q-K=V projection sharing reduces KV cache by 50 percent with only 3.1 percent perplexity degradation across tasks up to 1.2B parameters
- Combining Q-K=V with Multi-Query Attention enables 96.9 percent cache reduction, making billion-parameter models viable for edge devices
- The paper demonstrates that keys and values can share representational spaces because attention operates in a low-rank regime
- Performance holds across computer vision and language modeling tasks, with Q-K=V outperforming other sharing variants
- The approach combines multiplicatively with existing techniques like Grouped Query Attention and KV-cache quantization for further optimization