Bolek: Compact Multimodal LLM Grounds Molecular Reasoning in Structure, Outperforms Models 2x Its Size

Researchers published Bolek on arXiv on May 4, 2026, a compact multimodal language model designed for molecular reasoning that grounds natural-language explanations in verifiable molecular features. The 4-billion parameter model outperforms TxGemma-9B-Chat on 13 of 15 binary classification tasks despite being less than half its size.

Bolek Injects Chemical Fingerprints Into Language Model Architecture

Bolek addresses a critical gap in molecular property modeling: existing systems either return scores without rationale or produce fluent explanations weakly grounded in input molecules. The model injects Morgan fingerprint embeddings—a standard chemical representation—into an instruction-tuned text decoder built on Qwen3-4B-Instruct. Researchers fine-tuned Bolek on molecular alignment tasks including molecule description, RDKit descriptor prediction, and substructure detection, then trained it on downstream reasoning over 15 Therapeutics Data Commons (TDC) binary classification tasks using synthetic chains-of-thought anchored in concrete molecular features.

Model Achieves 38% Performance Improvement While Citing Verifiable Chemical Properties

Bolek raises mean ROC/PR AUC from 0.55 to 0.76 across TDC endpoints, outperforming its Qwen3-4B-Instruct base on all endpoints in yes/no mode and on 13 of 15 tasks in chain-of-thought mode. The model's explanations demonstrate significantly better grounding than baseline LLMs, citing numerical descriptors 10-100x more often per chain-of-thought. Cited values show strong agreement with RDKit calculations for key descriptors:

Topological Polar Surface Area (TPSA): Spearman ρ = 0.87-0.91
Molecular LogP (MolLogP): Spearman ρ = 0.87-0.91
Molecular Weight (MolWt): Spearman ρ = 0.87-0.91

On 15 unseen TDC classification endpoints, Bolek matches TxGemma performance on five tasks and produces non-trivial rank correlations on three held-out regression endpoints despite never training on downstream regression tasks.

Targeted Modality Injection Enables Auditable Drug Discovery Models

The research demonstrates that grounding LLMs in domain-specific representations like chemical fingerprints, combined with reasoning supervision tied to verifiable features, can produce compact, auditable models suitable for high-stakes drug discovery applications. This approach offers an alternative to scaling model size, instead focusing on targeted architectural modifications and training objectives that enhance both performance and interpretability.

Key Takeaways

Bolek, a 4-billion parameter multimodal LLM, outperforms TxGemma-9B-Chat (9 billion parameters) on 13 of 15 molecular classification tasks
The model cites numerical chemical descriptors 10-100x more often than baseline LLMs, with Spearman correlations of 0.87-0.91 against RDKit ground truth
Bolek raises mean ROC/PR AUC from 0.55 to 0.76 across Therapeutics Data Commons endpoints
The approach injects Morgan fingerprint embeddings directly into the language model decoder, grounding explanations in verifiable molecular features
Results suggest targeted modality injection can yield compact, auditable models suitable for high-stakes drug discovery without requiring massive scale

Bolek Injects Chemical Fingerprints Into Language Model Architecture

Model Achieves 38% Performance Improvement While Citing Verifiable Chemical Properties

Topological Polar Surface Area (TPSA): Spearman ρ = 0.87-0.91

Molecular LogP (MolLogP): Spearman ρ = 0.87-0.91

Molecular Weight (MolWt): Spearman ρ = 0.87-0.91

Targeted Modality Injection Enables Auditable Drug Discovery Models

Key Takeaways

Bolek, a 4-billion parameter multimodal LLM, outperforms TxGemma-9B-Chat (9 billion parameters) on 13 of 15 molecular classification tasks

The model cites numerical chemical descriptors 10-100x more often than baseline LLMs, with Spearman correlations of 0.87-0.91 against RDKit ground truth

Bolek raises mean ROC/PR AUC from 0.55 to 0.76 across Therapeutics Data Commons endpoints

The approach injects Morgan fingerprint embeddings directly into the language model decoder, grounding explanations in verifiable molecular features

Results suggest targeted modality injection can yield compact, auditable models suitable for high-stakes drug discovery without requiring massive scale