Data Pruning Method Boosts LLM Fact Retention 1.3x, Matches 10x Larger Model

Researchers have discovered that training language models on less data—but carefully selected data—dramatically improves their ability to memorize facts. A new study from Apple and Google shows that a 110-million-parameter model trained with strategic data pruning can match the factual knowledge of a 1.3-billion-parameter model trained conventionally.

Training Data Overwhelms Model Capacity for Facts

The research team formalized fact memorization from an information-theoretic perspective and identified why LLMs struggle with factual accuracy. They found that fact accuracy falls below the theoretical capacity limit whenever the information content in training data facts exceeds what the model can store. This problem worsens when fact frequency follows skewed distributions like power laws, which is common in real-world datasets.

Current approaches using penalty-based methods fail because aggressive penalties suppress essential facts while mild penalties get lost in the variance of accuracy rewards. The team proposed a counterintuitive solution: limit the number of facts in training data and flatten their frequency distribution.

GPT2-Small Achieves 1.3x Improvement on Wikipedia Facts

The researchers validated their approach through two sets of experiments. On semi-synthetic datasets with high-entropy facts, their data selection method successfully boosted fact accuracy to the theoretical capacity limit. More impressively, when pretraining GPT2-Small (110 million parameters) from scratch on annotated Wikipedia:

The model memorized 1.3x more entity facts compared to standard training
Performance matched a 10x larger model (1.3 billion parameters) pretrained on the full dataset
Selection was based solely on training loss, making the method practical and scalable

Cram Less to Fit More: Rethinking Pretraining Efficiency

The core insight—"cram less to fit more"—challenges conventional wisdom about LLM training. Rather than exposing models to all available facts, strategic data pruning allows models to retain more factual knowledge with fewer parameters. This suggests current LLM pretraining may be fundamentally inefficient for factual knowledge acquisition.

The method addresses a critical weakness in language models: their tendency to hallucinate and perform poorly on knowledge-intensive tasks. By ensuring models aren't overwhelmed with more information than they can store, the approach improves factual reliability without requiring larger architectures.

The findings have significant implications for LLM development, suggesting that smarter data curation could achieve the same factual performance with substantially smaller, more efficient models. Related research shows that pretraining on more data shows no significant improvement in models' capability to acquire and maintain factual knowledge, further supporting the case for quality over quantity.

Key Takeaways

Strategic data pruning enables a 110M-parameter model to memorize 1.3x more facts than standard training methods
The smaller model matches factual performance of a 1.3B-parameter model (10x larger) trained on full datasets
Fact accuracy drops below capacity when training data information exceeds model capacity, especially with skewed frequency distributions
Selection method based solely on training loss makes the approach practical and scalable
Findings suggest current LLM pretraining is inefficient for factual knowledge, with smarter data curation offering path to smaller, more capable models

Training Data Overwhelms Model Capacity for Facts

GPT2-Small Achieves 1.3x Improvement on Wikipedia Facts

The model memorized 1.3x more entity facts compared to standard training

Performance matched a 10x larger model (1.3 billion parameters) pretrained on the full dataset

Selection was based solely on training loss, making the method practical and scalable

Cram Less to Fit More: Rethinking Pretraining Efficiency

Key Takeaways

Strategic data pruning enables a 110M-parameter model to memorize 1.3x more facts than standard training methods

The smaller model matches factual performance of a 1.3B-parameter model (10x larger) trained on full datasets

Fact accuracy drops below capacity when training data information exceeds model capacity, especially with skewed frequency distributions

Selection method based solely on training loss makes the approach practical and scalable

Findings suggest current LLM pretraining is inefficient for factual knowledge, with smarter data curation offering path to smaller, more capable models