Cybersecurity

CWE Bench

CWE Bench measures whether coding agents can find and fix real cybersecurity vulnerabilities in real codebases. It spans 221 audit-and-patch tasks across 150+ CWEs, in languages from C and Python to 100K-line Java repositories, each graded by essential functional checks that confirm the flaw is actually closed, on bugs where frontier models still fail.

Cybersecurity benchmark · by Collinear AI

Can frontier agents solve 150+ CWEs of in-the-wild vulnerabilities?

Early results: pass@1 vs pass@4 across frontier coding agents on CWE Bench.

pass@1 pass@4

Higher is better. Fable 5 evaluated on a 24-task subset; all others on the full set.