TOSSS: New CVE-Based Security Benchmark Tests LLMs on Secure vs. Vulnerable Code Selection

Researchers have introduced TOSSS (Two-Option Secure Snippet Selection), a new benchmark that evaluates how well large language models distinguish between secure and vulnerable code. Published on arXiv, the study tested 14 widely used models on C/C++ and Java code, revealing scores ranging from 0.48 to 0.89—indicating some models perform barely better than random chance at identifying security risks. The CVE-based approach provides an extensible framework that can integrate newly disclosed vulnerabilities over time.

Wide Performance Gap Reveals Security Awareness Varies Dramatically

The benchmark presents LLMs with paired code examples—one secure, one vulnerable—and evaluates their selection accuracy. The scoring system ranges from 0 to 1, where a score of 1 means the model always selects the secure snippet and 0 means it always selects the vulnerable one.

14 models tested (both open-source and closed-source)
Scores ranged from 0.48 to 0.89
Evaluation covered C/C++ and Java code
Some models showed near-random performance at identifying secure code

CVE Database Integration Enables Continuous Updates

Researchers Marc Damie, Murat Bilgehan Ertan, Domenico Essoussi, Angela Makhanu, Gaëtan Peter, and Roos Wensveen designed TOSSS to address limitations in existing security benchmarks, which cover only a limited range of vulnerabilities. By relying on the CVE (Common Vulnerabilities and Exposures) database, TOSSS provides an extensible framework that can integrate newly disclosed vulnerabilities over time.

This CVE-based approach offers three key advantages:

Remains current with emerging security threats
Provides broader vulnerability coverage than fixed benchmarks
Continuously expands as new vulnerabilities are disclosed

Potential Standard for Security-Focused Model Evaluation

The paper suggests that LLM providers could publish TOSSS scores as a complementary security-focused metric alongside existing benchmark scores. This would give organizations visibility into how security-aware different models are when used in software development workflows—a critical consideration as AI-assisted coding becomes more prevalent.

The wide score range from 0.48 to 0.89 has major implications for organizations integrating LLMs into development pipelines. A model scoring near 0.48 provides minimal security value, while those approaching 0.89 demonstrate strong awareness of common vulnerability patterns.

Key Takeaways

TOSSS benchmark tests 14 LLMs on ability to distinguish secure from vulnerable code snippets
Model scores ranged from 0.48 to 0.89, with some performing barely better than random chance
CVE-based approach allows continuous integration of newly disclosed vulnerabilities
Benchmark covers C/C++ and Java code with extensible framework for additional languages
Researchers propose TOSSS scores as complementary security metric for LLM providers

Wide Performance Gap Reveals Security Awareness Varies Dramatically

14 models tested (both open-source and closed-source)

Scores ranged from 0.48 to 0.89

Evaluation covered C/C++ and Java code

Some models showed near-random performance at identifying secure code

CVE Database Integration Enables Continuous Updates

This CVE-based approach offers three key advantages:

Remains current with emerging security threats

Provides broader vulnerability coverage than fixed benchmarks

Continuously expands as new vulnerabilities are disclosed

Potential Standard for Security-Focused Model Evaluation

Key Takeaways

TOSSS benchmark tests 14 LLMs on ability to distinguish secure from vulnerable code snippets

Model scores ranged from 0.48 to 0.89, with some performing barely better than random chance

CVE-based approach allows continuous integration of newly disclosed vulnerabilities

Benchmark covers C/C++ and Java code with extensible framework for additional languages

Researchers propose TOSSS scores as complementary security metric for LLM providers