CRYSTAL Benchmark Reveals Systematic Reasoning Failures in 20 Leading Multimodal Models

Researchers have released CRYSTAL (Clear Reasoning via Yielded Steps, Traceability and Logic), a diagnostic benchmark that evaluates multimodal reasoning through 6,372 instances with verifiable intermediate steps. Published on arXiv on March 13, 2026, the benchmark reveals that all 20 evaluated multimodal large language models—including commercial frontier systems—exhibit universal cherry-picking behavior and disordered reasoning invisible to accuracy-only metrics.

Benchmark Tests Multimodal Reasoning Through Verifiable Steps

CRYSTAL addresses limitations in existing multimodal reasoning evaluations, which rely primarily on discriminative tasks like visual question answering that focus only on final accuracy. The benchmark instead evaluates models through verifiable intermediate steps, providing fine-grained analysis of reasoning quality. References were constructed through a Delphi-inspired pipeline where four independent MLLMs generated trajectories, which were then aggregated via semantic clustering and validated through human quality gates.

The benchmark introduces two complementary metrics: Match F1 scores step-level precision and recall via semantic similarity matching, while Ordered Match F1 additionally penalizes disordered reasoning chains. This dual-metric approach decomposes tasks into localization and execution, both framed as generative problems.

Evaluation Reveals Three Universal Failure Modes

Testing across 20 MLLMs revealed three systematic failures. First, universal cherry-picking: precision far exceeds recall across all models, indicating they generate some correct steps but miss many required reasoning components. Second, non-monotonic scaling trade-offs: larger models don't consistently improve reasoning quality, challenging assumptions about model scaling. Third, disordered reasoning: no competitive model preserves more than 60% of matched steps in correct order.

These failures remain invisible to accuracy-only metrics, which discard execution traces and evaluate only final answers. CRYSTAL's step-level evaluation reveals that models achieving similar final accuracy can have dramatically different reasoning quality.

Causal Process Reward Training Achieves 32% Improvement

The researchers introduce Causal Process Reward (CPR), a novel training approach that couples answer correctness with step-level alignment through multiplicative reward. Unlike additive reward strategies, CPR achieved +32% Match F1 improvement via GRPO training. CPR-Curriculum progressively increases reasoning difficulty during training, improving reasoning quality without requiring manual step annotation.

This training methodology demonstrates that step-level alignment can be learned through reward structure alone, without explicit supervision of intermediate reasoning steps. The approach moves evaluation beyond passive spatial reasoning toward reasoning to act, with applications in robot planning and complex decision-making tasks.

Key Takeaways

CRYSTAL evaluates 20 multimodal models across 6,372 instances with verifiable intermediate reasoning steps
All tested models exhibit universal cherry-picking with precision far exceeding recall in reasoning tasks
No competitive model preserves more than 60% of matched reasoning steps in correct order
Larger models show non-monotonic scaling with inconsistent reasoning quality improvements
Causal Process Reward training achieves +32% Match F1 improvement via GRPO without manual step annotation

Benchmark Tests Multimodal Reasoning Through Verifiable Steps

Evaluation Reveals Three Universal Failure Modes

Causal Process Reward Training Achieves 32% Improvement

Key Takeaways

CRYSTAL evaluates 20 multimodal models across 6,372 instances with verifiable intermediate reasoning steps

All tested models exhibit universal cherry-picking with precision far exceeding recall in reasoning tasks

No competitive model preserves more than 60% of matched reasoning steps in correct order

Larger models show non-monotonic scaling with inconsistent reasoning quality improvements

Causal Process Reward training achieves +32% Match F1 improvement via GRPO without manual step annotation