Do Sparse Autoencoders Capture Concept Manifolds? New Framework Questions Core Assumptions

Researchers have introduced a theoretical framework questioning whether sparse autoencoders (SAEs) effectively capture the geometric structure of concepts in neural networks. Published to arXiv on April 30, 2026, the paper reveals that SAEs suboptimally recover continuous structures by mixing global and local solutions in a fragmented regime called dilution, explaining why manifold structure is rarely visible at the level of individual concepts.

Framework Distinguishes Global and Local Manifold Capture Mechanisms

Sparse autoencoders are widely used to extract interpretable features from neural network representations, often under the implicit assumption that concepts correspond to independent linear directions. However, growing evidence suggests many concepts are organized along low-dimensional manifolds encoding continuous geometric relationships. The research establishes that SAEs can capture manifolds in two fundamentally different ways: globally by allocating a compact group of atoms whose linear span contains the entire manifold, or locally by distributing it across features that each selectively tile a restricted region of the underlying geometry.

Empirical Analysis Reveals Dilution Pattern Across SAE Architectures

The key empirical finding demonstrates that existing SAE architectures mix global subspace and local tiling solutions in a dilution regime rather than cleanly implementing either approach. This fragmentation explains previously puzzling observations about why manifold structure rarely appears visible when examining individual features or directions. The research motivates post-hoc unsupervised discovery methods that search for coherent groups of atoms rather than isolated directions.

Implications Challenge Feature-Centric Interpretability Paradigm

Current mechanistic interpretability approaches typically search for individual features or directions under the assumption that one feature equals one concept. The framework suggests this paradigm may systematically miss continuous concept structure distributed across multiple features forming geometric objects. The authors argue that future representation learning methods should treat geometric objects, not just individual directions, as the basic units of interpretability, fundamentally shifting how researchers approach feature extraction and concept identification.

Research Connects to Broader Geometric Concept Structure Literature

The work contributes to an active 2025-2026 research wave examining SAE limitations and geometric concept structure. Recent research has identified concept spaces containing crystals whose faces are parallelograms or trapezoids, while systems like SPARC learn unified latent spaces across architectures and modalities. The paper provides theoretical grounding for observations that individual SAE features may not cleanly correspond to semantic concepts, instead requiring analysis of feature groups and their geometric relationships.

Key Takeaways

SAEs can capture manifolds globally via compact atom groups or locally via selective tiling, but empirically mix both in a dilution regime
The dilution pattern explains why manifold structure is rarely visible when examining individual SAE features or directions
Current feature-centric interpretability approaches may systematically miss continuous concept structure distributed across multiple features
The research motivates shifting from isolated direction search toward identifying coherent geometric structures as basic interpretability units
Findings contribute to growing evidence that the one feature equals one concept assumption underlying much interpretability research may be fundamentally flawed

Framework Distinguishes Global and Local Manifold Capture Mechanisms

Empirical Analysis Reveals Dilution Pattern Across SAE Architectures

Implications Challenge Feature-Centric Interpretability Paradigm

Research Connects to Broader Geometric Concept Structure Literature

Key Takeaways

SAEs can capture manifolds globally via compact atom groups or locally via selective tiling, but empirically mix both in a dilution regime

The dilution pattern explains why manifold structure is rarely visible when examining individual SAE features or directions

Current feature-centric interpretability approaches may systematically miss continuous concept structure distributed across multiple features

The research motivates shifting from isolated direction search toward identifying coherent geometric structures as basic interpretability units

Findings contribute to growing evidence that the one feature equals one concept assumption underlying much interpretability research may be fundamentally flawed