Researchers have introduced TOSSS (Two-Option Secure Snippet Selection), a new benchmark that evaluates how well large language models distinguish between secure and vulnerable code. Published on arXiv, the study tested 14 widely used models on C/C++ and Java code, revealing scores ranging from 0.48 to 0.89—indicating some models perform barely better than random chance at identifying security risks. The CVE-based approach provides an extensible framework that can integrate newly disclosed vulnerabilities over time.
Wide Performance Gap Reveals Security Awareness Varies Dramatically
The benchmark presents LLMs with paired code examples—one secure, one vulnerable—and evaluates their selection accuracy. The scoring system ranges from 0 to 1, where a score of 1 means the model always selects the secure snippet and 0 means it always selects the vulnerable one.
- 14 models tested (both open-source and closed-source)
- Scores ranged from 0.48 to 0.89
- Evaluation covered C/C++ and Java code
- Some models showed near-random performance at identifying secure code
CVE Database Integration Enables Continuous Updates
Researchers Marc Damie, Murat Bilgehan Ertan, Domenico Essoussi, Angela Makhanu, Gaëtan Peter, and Roos Wensveen designed TOSSS to address limitations in existing security benchmarks, which cover only a limited range of vulnerabilities. By relying on the CVE (Common Vulnerabilities and Exposures) database, TOSSS provides an extensible framework that can integrate newly disclosed vulnerabilities over time.
This CVE-based approach offers three key advantages:
- Remains current with emerging security threats
- Provides broader vulnerability coverage than fixed benchmarks
- Continuously expands as new vulnerabilities are disclosed
Potential Standard for Security-Focused Model Evaluation
The paper suggests that LLM providers could publish TOSSS scores as a complementary security-focused metric alongside existing benchmark scores. This would give organizations visibility into how security-aware different models are when used in software development workflows—a critical consideration as AI-assisted coding becomes more prevalent.
The wide score range from 0.48 to 0.89 has major implications for organizations integrating LLMs into development pipelines. A model scoring near 0.48 provides minimal security value, while those approaching 0.89 demonstrate strong awareness of common vulnerability patterns.
Key Takeaways
- TOSSS benchmark tests 14 LLMs on ability to distinguish secure from vulnerable code snippets
- Model scores ranged from 0.48 to 0.89, with some performing barely better than random chance
- CVE-based approach allows continuous integration of newly disclosed vulnerabilities
- Benchmark covers C/C++ and Java code with extensible framework for additional languages
- Researchers propose TOSSS scores as complementary security metric for LLM providers