VisionFoundry: Synthetic Data Generation Fixes Vision-Language Model Perception Weaknesses

Researchers from UC Berkeley and Meta have developed VisionFoundry, an automated system that uses synthetic data to address persistent visual perception weaknesses in vision-language models. The approach achieves a 7% improvement on visual perception benchmarks and a 10% gain on 3D visual reasoning tasks without requiring human annotation or reference images.

Despite rapid advances in multimodal AI, vision-language models continue to struggle with fundamental visual perception tasks including spatial understanding, depth ordering, and viewpoint recognition. The VisionFoundry research demonstrates that limited task-targeted supervision, rather than fundamental model limitations, is the primary bottleneck.

Fully Automated Synthetic Data Pipeline

VisionFoundry requires only a task keyword as input and handles the entire data generation process automatically:

Uses large language models to generate questions, answers, and text-to-image prompts
Synthesizes images using text-to-image models
Verifies consistency with a proprietary vision-language model
Produces task-aware training data tailored to specific visual perception weaknesses
Operates without any reference images or human annotation

The system generated the VisionFoundry-10K dataset, containing 10,000 synthetic image-question-answer triples spanning 10 distinct visual perception tasks.

Significant Performance Gains

Models trained on the VisionFoundry-10K dataset demonstrated substantial improvements on visual perception benchmarks:

7% improvement on MMVP, a specialized visual perception benchmark
10% improvement on CV-Bench-3D for 3D visual reasoning tasks
Performance gains achieved while preserving broader multimodal capabilities
Favorable scaling behavior as synthetic data volume increases

The results suggest that targeted synthetic supervision can address specific capability gaps without degrading general-purpose performance, a common challenge in model fine-tuning.

Implications for Vision-Language Model Development

The research challenges the assumption that visual perception weaknesses in VLMs stem from fundamental architectural limitations. Instead, the findings point to insufficient task-specific training data as the primary constraint. This insight suggests that synthetic data generation, combined with automated verification, offers a scalable path to improving vision-language models on targeted capabilities.

The VisionFoundry approach could be extended to other visual reasoning tasks and potentially to multimodal domains beyond vision-language modeling, wherever targeted capability improvements are needed without access to large-scale human-annotated datasets.

Key Takeaways

VisionFoundry automatically generates synthetic training data for visual perception tasks using only task keywords as input
Models trained on 10,000 synthetic examples achieve 7% improvement on visual perception benchmarks and 10% on 3D reasoning tasks
The approach requires no reference images or human annotation, relying entirely on automated generation and verification
Results indicate that limited task-targeted supervision, not fundamental model limitations, is the primary bottleneck for VLM visual perception
Synthetic data shows favorable scaling behavior, suggesting continued improvements with larger datasets

Fully Automated Synthetic Data Pipeline

VisionFoundry requires only a task keyword as input and handles the entire data generation process automatically:

Uses large language models to generate questions, answers, and text-to-image prompts

Synthesizes images using text-to-image models

Verifies consistency with a proprietary vision-language model

Produces task-aware training data tailored to specific visual perception weaknesses

Operates without any reference images or human annotation

The system generated the VisionFoundry-10K dataset, containing 10,000 synthetic image-question-answer triples spanning 10 distinct visual perception tasks.

Significant Performance Gains

Models trained on the VisionFoundry-10K dataset demonstrated substantial improvements on visual perception benchmarks:

7% improvement on MMVP, a specialized visual perception benchmark

10% improvement on CV-Bench-3D for 3D visual reasoning tasks

Performance gains achieved while preserving broader multimodal capabilities

Favorable scaling behavior as synthetic data volume increases

The results suggest that targeted synthetic supervision can address specific capability gaps without degrading general-purpose performance, a common challenge in model fine-tuning.

Implications for Vision-Language Model Development

Key Takeaways

VisionFoundry automatically generates synthetic training data for visual perception tasks using only task keywords as input

Models trained on 10,000 synthetic examples achieve 7% improvement on visual perception benchmarks and 10% on 3D reasoning tasks

The approach requires no reference images or human annotation, relying entirely on automated generation and verification

Results indicate that limited task-targeted supervision, not fundamental model limitations, is the primary bottleneck for VLM visual perception

Synthetic data shows favorable scaling behavior, suggesting continued improvements with larger datasets