Frontier LLMs Show Rigid Adaptation in Reversal Learning Study

A new study published on arXiv on April 5, 2026, reveals that leading language models including GPT-5.2, Gemini-3, and DeepSeek-V3.2 exhibit systematic weaknesses in adapting to changing environments. The research used reversal learning tasks to demonstrate that even frontier models struggle with non-stationary uncertainty, challenging assumptions about their human-like reasoning capabilities.

Asymmetric Learning Reveals Win-Stay, Lose-Persist Pattern

Researchers tested the three frontier models in a two-option probabilistic reversal-learning task that compared deterministic fixed transitions against stochastic random schedules. Across all models, win-stay behavior approached ceiling performance while lose-shift responses were markedly attenuated, revealing asymmetric processing of positive versus negative evidence.

The models demonstrated that they learn effectively from successes but struggle to adapt when outcomes indicate their current strategy is failing. This asymmetry persisted even when researchers increased environmental volatility through random transition schedules.

DeepSeek-V3.2 Shows Extreme Perseveration While Others Adapt Slowly

Model-specific performance patterns emerged:

DeepSeek-V3.2: Exhibited extreme perseveration after reversals with weak initial acquisition of new patterns
Gemini-3 and GPT-5.2: Adapted more rapidly than DeepSeek but remained significantly less loss-sensitive than human baseline performance
All models: Random transitions amplified reversal-specific persistence without uniformly reducing total wins

High Performance Can Mask Adaptation Failures

A critical finding is that high aggregate payoff can coexist with rigid adaptation. Models might score well on static benchmarks while failing when conditions change—a concern for real-world deployment where environments are rarely stationary.

Hierarchical reinforcement learning fits indicated that rigidity arises from dissociable mechanisms including weak loss learning, inflated policy determinism, or value polarization via counterfactual suppression. These mechanistic insights suggest that current training approaches may not adequately prepare models for non-stationary environments.

Implications for Evaluation and Development

The authors, Haomiaomiao Wang, Tomás E Ward, and Lili Zhang, argue that the results motivate reversal-sensitive diagnostics and volatility-aware models for evaluating LLMs under non-stationary uncertainty. Current benchmarks may fail to capture critical aspects of practical intelligence by focusing on static problem-solving rather than adaptation.

The research suggests that even as models achieve impressive performance on standard benchmarks, they may lack fundamental capabilities for handling the kind of dynamic, changing environments they would encounter in real-world applications.

Key Takeaways

Study tested GPT-5.2, Gemini-3, and DeepSeek-V3.2 in reversal learning tasks measuring adaptation to changing environments
All models showed asymmetric learning with near-ceiling win-stay behavior but markedly attenuated lose-shift responses
DeepSeek-V3.2 exhibited extreme perseveration after reversals while Gemini-3 and GPT-5.2 adapted more rapidly but remained less loss-sensitive than humans
High aggregate performance can mask rigid adaptation, suggesting current benchmarks may not capture critical aspects of practical intelligence
Findings indicate that frontier models lack fundamental capabilities for handling dynamic, non-stationary environments despite strong static benchmark performance

Asymmetric Learning Reveals Win-Stay, Lose-Persist Pattern

DeepSeek-V3.2 Shows Extreme Perseveration While Others Adapt Slowly

Model-specific performance patterns emerged:

DeepSeek-V3.2: Exhibited extreme perseveration after reversals with weak initial acquisition of new patterns

Gemini-3 and GPT-5.2: Adapted more rapidly than DeepSeek but remained significantly less loss-sensitive than human baseline performance

All models: Random transitions amplified reversal-specific persistence without uniformly reducing total wins

High Performance Can Mask Adaptation Failures

Implications for Evaluation and Development

Key Takeaways

Study tested GPT-5.2, Gemini-3, and DeepSeek-V3.2 in reversal learning tasks measuring adaptation to changing environments

All models showed asymmetric learning with near-ceiling win-stay behavior but markedly attenuated lose-shift responses

DeepSeek-V3.2 exhibited extreme perseveration after reversals while Gemini-3 and GPT-5.2 adapted more rapidly but remained less loss-sensitive than humans

High aggregate performance can mask rigid adaptation, suggesting current benchmarks may not capture critical aspects of practical intelligence

Findings indicate that frontier models lack fundamental capabilities for handling dynamic, non-stationary environments despite strong static benchmark performance