Researchers published Bolek on arXiv on May 4, 2026, a compact multimodal language model designed for molecular reasoning that grounds natural-language explanations in verifiable molecular features. The 4-billion parameter model outperforms TxGemma-9B-Chat on 13 of 15 binary classification tasks despite being less than half its size.
Bolek Injects Chemical Fingerprints Into Language Model Architecture
Bolek addresses a critical gap in molecular property modeling: existing systems either return scores without rationale or produce fluent explanations weakly grounded in input molecules. The model injects Morgan fingerprint embeddings—a standard chemical representation—into an instruction-tuned text decoder built on Qwen3-4B-Instruct. Researchers fine-tuned Bolek on molecular alignment tasks including molecule description, RDKit descriptor prediction, and substructure detection, then trained it on downstream reasoning over 15 Therapeutics Data Commons (TDC) binary classification tasks using synthetic chains-of-thought anchored in concrete molecular features.
Model Achieves 38% Performance Improvement While Citing Verifiable Chemical Properties
Bolek raises mean ROC/PR AUC from 0.55 to 0.76 across TDC endpoints, outperforming its Qwen3-4B-Instruct base on all endpoints in yes/no mode and on 13 of 15 tasks in chain-of-thought mode. The model's explanations demonstrate significantly better grounding than baseline LLMs, citing numerical descriptors 10-100x more often per chain-of-thought. Cited values show strong agreement with RDKit calculations for key descriptors:
- Topological Polar Surface Area (TPSA): Spearman ρ = 0.87-0.91
- Molecular LogP (MolLogP): Spearman ρ = 0.87-0.91
- Molecular Weight (MolWt): Spearman ρ = 0.87-0.91
On 15 unseen TDC classification endpoints, Bolek matches TxGemma performance on five tasks and produces non-trivial rank correlations on three held-out regression endpoints despite never training on downstream regression tasks.
Targeted Modality Injection Enables Auditable Drug Discovery Models
The research demonstrates that grounding LLMs in domain-specific representations like chemical fingerprints, combined with reasoning supervision tied to verifiable features, can produce compact, auditable models suitable for high-stakes drug discovery applications. This approach offers an alternative to scaling model size, instead focusing on targeted architectural modifications and training objectives that enhance both performance and interpretability.
Key Takeaways
- Bolek, a 4-billion parameter multimodal LLM, outperforms TxGemma-9B-Chat (9 billion parameters) on 13 of 15 molecular classification tasks
- The model cites numerical chemical descriptors 10-100x more often than baseline LLMs, with Spearman correlations of 0.87-0.91 against RDKit ground truth
- Bolek raises mean ROC/PR AUC from 0.55 to 0.76 across Therapeutics Data Commons endpoints
- The approach injects Morgan fingerprint embeddings directly into the language model decoder, grounding explanations in verifiable molecular features
- Results suggest targeted modality injection can yield compact, auditable models suitable for high-stakes drug discovery without requiring massive scale