Researchers have introduced Energy-Based Fine-Tuning (EBFT), a new approach to language model training that matches the accuracy of reinforcement learning methods while achieving lower validation cross-entropy. The method, detailed in a paper published March 12, 2026 on arXiv by researchers including Samy Jelassi, Mujin Kwun, and colleagues, addresses a fundamental limitation in how language models are trained.
Cross-Entropy Training Optimizes for Token Prediction, Not Sequence Behavior
Traditional cross-entropy training optimizes next-token prediction under teacher forcing, which doesn't directly target the sequence-level behavior that occurs during actual model deployment. While this approach provides dense and scalable supervision, it creates a mismatch between training and inference. Reinforcement learning approaches like RLVR (Reinforcement Learning with Verifiable Rewards) address this but often require task-specific verifiers or preference models.
EBFT Matches Feature Statistics Instead of Individual Tokens
The core innovation of EBFT is a feature-matching objective that targets sequence-level statistics of the completion distribution. Instead of optimizing token-by-token predictions, EBFT provides dense semantic feedback by matching feature-space representations of generated text to target completions.
The technical implementation uses strided block-parallel sampling to generate multiple rollouts from nested prefixes concurrently. The method batches feature extraction over these rollouts and uses the resulting embeddings to perform on-policy policy-gradient updates. The authors present theoretical connections between EBFT and KL-regularized feature-matching and energy-based modeling.
Performance Gains Across Multiple Benchmarks
EBFT demonstrated strong empirical results across three challenging domains:
- Q&A coding tasks: Matched RLVR accuracy while achieving lower validation loss
- Unstructured coding: Outperformed supervised fine-tuning (SFT) on downstream accuracy
- Translation: Maintained competitive performance across all metrics
Crucially, EBFT achieved lower validation cross-entropy than both RLVR and SFT methods, suggesting better generalization properties. This represents a practical middle ground between supervised fine-tuning's simplicity and reinforcement learning's sequence-level optimization.
Key Takeaways
- EBFT introduces a feature-matching objective that optimizes for sequence-level statistics rather than token-by-token predictions
- The method matches RLVR's downstream accuracy across Q&A coding, unstructured coding, and translation tasks
- EBFT achieves lower validation cross-entropy than both RLVR and supervised fine-tuning, indicating better generalization
- The approach uses strided block-parallel sampling to efficiently generate and process multiple rollouts from nested prefixes
- EBFT provides a practical middle ground between supervised learning and reinforcement learning for language model fine-tuning