Researchers Mingwei Xu and Hao Fang have introduced POPO (Positive-Only Policy Optimization), a novel reinforcement learning framework that trains language models using only positive examples—no negative rollouts required. Published on arXiv on May 7, 2026, POPO achieved 36.67% accuracy on the notoriously difficult AIME 2025 mathematics benchmark using Qwen-Math-7B, significantly outperforming the GRPO baseline's 30.00% score.
Traditional Methods Rely on Contrasting Positive and Negative Examples
Standard policy optimization methods like PPO and GRPO typically learn by contrasting successful (positive) rollouts with failed (negative) ones. The POPO framework challenges this paradigm by demonstrating that effective learning can occur exclusively through positive examples. The key insight is that negative rollouts often provide limited information in sparse reward settings—a single failed attempt may not meaningfully represent the vast space of possible failures, and failures themselves may not have gradations of severity that guide learning.
POPO Uses Bounded Importance Sampling and Siamese Networks
The technical innovation behind POPO centers on bounded importance sampling over positive rollout sets, where implicit negative gradients emerge naturally through probability redistribution among successful examples. The framework employs two stabilization mechanisms: a Siamese policy network with momentum-based adaptation for controlled policy evolution, and a bounded similarity penalty in siamese representation space that replaces traditional KL-divergence constraints.
Tested across the Qwen model family on mathematical reasoning benchmarks, POPO demonstrated performance comparable to or superior to GRPO across multiple difficulty levels. Ablation studies confirmed that each component of the framework contributes meaningfully to overall performance.
POPO Represents a Paradigm Shift in Policy Optimization
The framework addresses a fundamental challenge in reinforcement learning with verifiable rewards (RLVR): in problems with sparse binary rewards and massive combinatorial spaces, penalizing a few sampled negative examples is unlikely to provide meaningful gradient signals. By focusing exclusively on reinforcing positive probability mass, POPO avoids this issue while maintaining stable training dynamics.
The research has implications beyond mathematical reasoning, potentially applicable to any domain where positive examples are easier to identify and verify than negative ones, such as code generation, formal theorem proving, and other structured reasoning tasks.
Key Takeaways
- POPO achieved 36.67% on AIME 2025 with Qwen-Math-7B, outperforming GRPO's 30.00% by 22%
- The framework learns exclusively from positive rollouts without requiring negative examples for contrast
- Implicit negative gradients emerge through probability redistribution among successful examples
- Stabilization mechanisms include Siamese policy networks and bounded similarity penalties
- Ablation studies confirm each component of POPO contributes meaningfully to overall performance