Researchers have discovered that training language models on less data—but carefully selected data—dramatically improves their ability to memorize facts. A new study from Apple and Google shows that a 110-million-parameter model trained with strategic data pruning can match the factual knowledge of a 1.3-billion-parameter model trained conventionally.
Training Data Overwhelms Model Capacity for Facts
The research team formalized fact memorization from an information-theoretic perspective and identified why LLMs struggle with factual accuracy. They found that fact accuracy falls below the theoretical capacity limit whenever the information content in training data facts exceeds what the model can store. This problem worsens when fact frequency follows skewed distributions like power laws, which is common in real-world datasets.
Current approaches using penalty-based methods fail because aggressive penalties suppress essential facts while mild penalties get lost in the variance of accuracy rewards. The team proposed a counterintuitive solution: limit the number of facts in training data and flatten their frequency distribution.
GPT2-Small Achieves 1.3x Improvement on Wikipedia Facts
The researchers validated their approach through two sets of experiments. On semi-synthetic datasets with high-entropy facts, their data selection method successfully boosted fact accuracy to the theoretical capacity limit. More impressively, when pretraining GPT2-Small (110 million parameters) from scratch on annotated Wikipedia:
- The model memorized 1.3x more entity facts compared to standard training
- Performance matched a 10x larger model (1.3 billion parameters) pretrained on the full dataset
- Selection was based solely on training loss, making the method practical and scalable
Cram Less to Fit More: Rethinking Pretraining Efficiency
The core insight—"cram less to fit more"—challenges conventional wisdom about LLM training. Rather than exposing models to all available facts, strategic data pruning allows models to retain more factual knowledge with fewer parameters. This suggests current LLM pretraining may be fundamentally inefficient for factual knowledge acquisition.
The method addresses a critical weakness in language models: their tendency to hallucinate and perform poorly on knowledge-intensive tasks. By ensuring models aren't overwhelmed with more information than they can store, the approach improves factual reliability without requiring larger architectures.
The findings have significant implications for LLM development, suggesting that smarter data curation could achieve the same factual performance with substantially smaller, more efficient models. Related research shows that pretraining on more data shows no significant improvement in models' capability to acquire and maintain factual knowledge, further supporting the case for quality over quantity.
Key Takeaways
- Strategic data pruning enables a 110M-parameter model to memorize 1.3x more facts than standard training methods
- The smaller model matches factual performance of a 1.3B-parameter model (10x larger) trained on full datasets
- Fact accuracy drops below capacity when training data information exceeds model capacity, especially with skewed frequency distributions
- Selection method based solely on training loss makes the approach practical and scalable
- Findings suggest current LLM pretraining is inefficient for factual knowledge, with smarter data curation offering path to smaller, more capable models