Cybersecurity
CWE Bench
CWE Bench measures whether coding agents can find and fix real cybersecurity vulnerabilities in real codebases. It spans 221 audit-and-patch tasks across 150+ CWEs, in languages from C and Python to 100K-line Java repositories, each graded by essential functional checks that confirm the flaw is actually closed, on bugs where frontier models still fail.
Can frontier agents solve 150+ CWEs of in-the-wild vulnerabilities?
Early results: pass@1 vs pass@4 across frontier coding agents on CWE Bench.
Higher is better. Fable 5 evaluated on a 24-task subset; all others on the full set.