Energy-Based Fine-Tuning Matches RLVR Performance While Achieving Lower Validation Loss

Researchers have introduced Energy-Based Fine-Tuning (EBFT), a new approach to language model training that matches the accuracy of reinforcement learning methods while achieving lower validation cross-entropy. The method, detailed in a paper published March 12, 2026 on arXiv by researchers including Samy Jelassi, Mujin Kwun, and colleagues, addresses a fundamental limitation in how language models are trained.

Cross-Entropy Training Optimizes for Token Prediction, Not Sequence Behavior

Traditional cross-entropy training optimizes next-token prediction under teacher forcing, which doesn't directly target the sequence-level behavior that occurs during actual model deployment. While this approach provides dense and scalable supervision, it creates a mismatch between training and inference. Reinforcement learning approaches like RLVR (Reinforcement Learning with Verifiable Rewards) address this but often require task-specific verifiers or preference models.

EBFT Matches Feature Statistics Instead of Individual Tokens

The core innovation of EBFT is a feature-matching objective that targets sequence-level statistics of the completion distribution. Instead of optimizing token-by-token predictions, EBFT provides dense semantic feedback by matching feature-space representations of generated text to target completions.

The technical implementation uses strided block-parallel sampling to generate multiple rollouts from nested prefixes concurrently. The method batches feature extraction over these rollouts and uses the resulting embeddings to perform on-policy policy-gradient updates. The authors present theoretical connections between EBFT and KL-regularized feature-matching and energy-based modeling.

Performance Gains Across Multiple Benchmarks

EBFT demonstrated strong empirical results across three challenging domains:

Q&A coding tasks: Matched RLVR accuracy while achieving lower validation loss
Unstructured coding: Outperformed supervised fine-tuning (SFT) on downstream accuracy
Translation: Maintained competitive performance across all metrics

Crucially, EBFT achieved lower validation cross-entropy than both RLVR and SFT methods, suggesting better generalization properties. This represents a practical middle ground between supervised fine-tuning's simplicity and reinforcement learning's sequence-level optimization.

Key Takeaways

EBFT introduces a feature-matching objective that optimizes for sequence-level statistics rather than token-by-token predictions
The method matches RLVR's downstream accuracy across Q&A coding, unstructured coding, and translation tasks
EBFT achieves lower validation cross-entropy than both RLVR and supervised fine-tuning, indicating better generalization
The approach uses strided block-parallel sampling to efficiently generate and process multiple rollouts from nested prefixes
EBFT provides a practical middle ground between supervised learning and reinforcement learning for language model fine-tuning

Cross-Entropy Training Optimizes for Token Prediction, Not Sequence Behavior

EBFT Matches Feature Statistics Instead of Individual Tokens

Performance Gains Across Multiple Benchmarks

EBFT demonstrated strong empirical results across three challenging domains:

Q&A coding tasks: Matched RLVR accuracy while achieving lower validation loss

Unstructured coding: Outperformed supervised fine-tuning (SFT) on downstream accuracy

Translation: Maintained competitive performance across all metrics

Key Takeaways

EBFT introduces a feature-matching objective that optimizes for sequence-level statistics rather than token-by-token predictions

The method matches RLVR's downstream accuracy across Q&A coding, unstructured coding, and translation tasks

EBFT achieves lower validation cross-entropy than both RLVR and supervised fine-tuning, indicating better generalization

The approach uses strided block-parallel sampling to efficiently generate and process multiple rollouts from nested prefixes

EBFT provides a practical middle ground between supervised learning and reinforcement learning for language model fine-tuning