Corpus: 433 cases across 10 languages, 432 evaluated (1 disabled). Per-run JSON lands in `tests/benchmark/results/` (`latest.json` plus dated snapshots). See `README.md` for what the scoring modes mean and how to run a subset.
The corpus is mostly synthetic 8-20 line fixtures, one vulnerability or one safe pattern per file. A smaller real-CVE replay set under `cve_corpus/` covers 18 published CVEs across all 10 languages. Both contribute to the headline numbers.
## Real CVE coverage
Real disclosed CVEs reduced to minimal reproducers, vulnerable + patched pair per CVE. Vulnerable fixtures must produce a finding for the disclosed sink class. Patched fixtures must produce zero findings.
| CVE | Language | Project | License | Class | Status |
Fixtures are minimal reproducers of the unsafe pattern, not verbatim upstream code.
## CI floor
CI fails the build if rule-level precision drops below 0.861, recall below 0.944, or F1 below 0.901. Floors sit roughly 8 percentage points below the live baseline. A single-case flip is about 0.6 pp on this corpus, so the headroom absorbs honest FP/TN trades while still tripping on a class-level regression. Floors only move up, when a durable improvement lands. Never relax them to paper over a regression.
The gate runs in the `benchmark-gate` job in `.github/workflows/ci.yml`. Thresholds are encoded at the bottom of `tests/benchmark_test.rs`.
## Recent changes
Most recent first. Metrics are rule-level on the corpus size at that point.
| 2026-03-21 | Constant-arg suppression at AST and CFG level | 95 | 0.654 | 0.964 | 0.779 |
| 2026-03-21 | Bare `exec`/`execSync` as JS CMDI sinks; Python `Template` as XSS sink | 95 | 0.624 | 0.964 | 0.757 |
| 2026-03-21 | First baseline after symbolic-strings work | 95 | 0.620 | 0.891 | 0.731 |
## Known limitations
These show up across multiple corpora and aren't fully fixed yet.
- **Variable-receiver method calls** (`client.send(...)` vs `HttpClient.send(...)`) miss without an inferred receiver type. Type-aware callee resolution closes most cases; some residuals remain.
- **Arbitrary import aliases** (`from flask import request as r`) aren't traced. Only explicitly listed aliases resolve.
- **URL-parsing isn't credited as SSRF sanitization.** Allowlist checks in conditions are recognised; call-site sanitizers aren't.
- **Rust unguarded-sink** still fires for shell-escape sinks when a source is in scope but not flowing to the sink arg. Intentional for high-risk classes.