Cybersecurity

CWE Bench

CWE Bench measures whether coding agents can find and fix real cybersecurity vulnerabilities in real codebases. It spans 221 audit-and-patch tasks across 150+ CWEs, in languages from C and Python to 100K-line Java repositories, each graded by essential functional checks that confirm the flaw is actually closed, on bugs where frontier models still fail.

Cybersecurity benchmark · by Collinear AI

Can frontier agents solve 150+ CWEs of in-the-wild vulnerabilities?

Early results: pass@1 vs pass@4 across frontier coding agents on CWE Bench.

100 80 60 40 20 0 Pass rate 21.6 29.4 29.9 40.3 25.6 34.8 32.3 45.8 Gemini-3.5-Flashwith agy GPT-5.5 Codex Opus-4-8 withClaude Code Fable 5 withClaude Code(24 samples only)
pass@1 pass@4

Higher is better. Fable 5 evaluated on a 24-task subset; all others on the full set.