Benchmark Results

Current baseline (2026-05-02):

Metric	File-level	Rule-level	CI floor
Precision	1.000	1.000	0.861
Recall	1.000	1.000	0.944
F1	1.000	1.000	0.901

Corpus: 507 cases across 10 languages, 504 evaluated (3 disabled). Per-run JSON lands in tests/benchmark/results/ (latest.json plus dated snapshots). See README.md for what the scoring modes mean and how to run a subset.

The corpus is mostly synthetic 8-20 line fixtures, one vulnerability or one safe pattern per file. A smaller real-CVE replay set under cve_corpus/ covers 30 published advisories across all 10 languages. Both contribute to the headline numbers.

Real CVE coverage

Real disclosed CVEs reduced to minimal reproducers, vulnerable + patched pair per CVE. Vulnerable fixtures must produce a finding for the disclosed sink class. Patched fixtures must produce zero findings.

CVE	Language	Project	License	Class	Status
CVE-2023-48022	Python	Ray	Apache-2.0	CMDI	detected
CVE-2017-18342	Python	PyYAML	MIT	Deserialization	detected
CVE-2025-69662	Python	geopandas	BSD-3-Clause	SQL Injection	detected
CVE-2026-33626	Python	LMDeploy	Apache-2.0	SSRF	detected
CVE-2024-23334	Python	aiohttp	Apache-2.0	path_traversal	detected
CVE-2023-6568	Python	MLflow	Apache-2.0	XSS	detected
CVE-2024-21513	Python	LangChain Experimental	MIT	code_exec	detected
CVE-2019-14939	JavaScript	mongo-express	MIT	code_exec	detected
CVE-2025-64430	JavaScript	Parse Server	Apache-2.0	SSRF	detected
CVE-2023-22621	JavaScript	Strapi	MIT	code_exec (SSTI)	detected
CVE-2023-26159	TypeScript	follow-redirects	MIT	SSRF	detected
GHSA-4x48-cgf9-q33f	TypeScript	Novu	MIT	SSRF	detected
CVE-2022-30323	Go	hashicorp/go-getter	MPL-2.0	CMDI	detected
CVE-2023-3188	Go	owncast	MIT	SSRF	detected
CVE-2024-31450	Go	owncast	MIT	path_traversal	detected
CVE-2026-41422	Go	daptin	LGPL-3.0	sql_injection	detected
CVE-2015-7501	Java	Apache Commons Collections	Apache-2.0	Deserialization	detected
CVE-2017-12629	Java	Apache Solr	Apache-2.0	CMDI	detected
CVE-2022-1471	Java	SnakeYAML	Apache-2.0	Deserialization	detected
CVE-2022-42889	Java	Apache Commons Text	Apache-2.0	code_exec	detected
GHSA-h8cj-hpmg-636v	Java	Appsmith	Apache-2.0	sql_injection	detected
CVE-2013-0156	Ruby	Ruby on Rails	MIT	Deserialization	detected
CVE-2020-8130	Ruby	Rake	MIT	CMDI	detected
CVE-2021-21288	Ruby	CarrierWave	MIT	SSRF	detected
CVE-2023-38337	Ruby	rswag	MIT	path_traversal	detected
CVE-2017-9841	PHP	PHPUnit	BSD-3-Clause	code_exec	detected
CVE-2018-15133	PHP	Laravel	MIT	Deserialization	detected
CVE-2026-33486	PHP	Roadiz CMS	MIT	SSRF	detected
CVE-2018-20997	Rust	tar-rs	MIT OR Apache-2.0	path_traversal	detected
CVE-2022-36113	Rust	cargo	MIT OR Apache-2.0	path_traversal	detected
CVE-2023-42456	Rust	sudo-rs	Apache-2.0	path_traversal	detected
CVE-2024-24576	Rust	Rust stdlib	MIT OR Apache-2.0	CMDI	detected
CVE-2024-32884	Rust	gitoxide	Apache-2.0 OR MIT	CMDI	detected
CVE-2025-53549	Rust	matrix-rust-sdk	Apache-2.0	SQL Injection	detected
CVE-2016-3714	C	ImageMagick (ImageTragick)	ImageMagick License	CMDI	detected
CVE-2017-1000117	C	git (ssh:// argv injection)	GPL-2.0	cmdi (argv-inj)	deferred
CVE-2019-18634	C	sudo (pwfeedback)	ISC	memory_safety	detected
CVE-2019-13132	C++	ZeroMQ libzmq	MPL-2.0	memory_safety	detected
CVE-2022-1941	C++	Protocol Buffers	BSD-3-Clause	memory_safety	detected
CVE-2026-25544	TypeScript	Payload (Drizzle adapter)	MIT	sql_injection	deferred
CVE-2026-42353	JavaScript	i18next-http-middleware	MIT	path_traversal	detected

Deferred entries are real bugs Nyx can't yet detect. The fixture stays committed with disabled: true in ground truth so the gap remains visible.

How CVEs get picked

Publicly disclosed with a stable advisory link.
Class Nyx already has a rule for, so the vulnerable fixture asserts on a concrete rule ID, not just a generic taint flow.
Reducible to roughly 30 lines without hiding the disclosed sink shape.
Permissive upstream license (MIT, Apache, BSD, MPL, ISC, ImageMagick).

Fixtures are minimal reproducers of the unsafe pattern, not verbatim upstream code.

CI floor

CI fails the build if rule-level precision drops below 0.861, recall below 0.944, or F1 below 0.901. Floors sit roughly 8 percentage points below the live baseline. A single-case flip is about 0.6 pp on this corpus, so the headroom absorbs honest FP/TN trades while still tripping on a class-level regression. Floors only move up, when a durable improvement lands. Never relax them to paper over a regression.

The gate runs in the benchmark-gate job in .github/workflows/ci.yml. Thresholds are encoded at the bottom of tests/benchmark_test.rs.

Recent changes

Most recent first. Metrics are rule-level on the corpus size at that point.

Date	Change	Corpus	P	R	F1
2026-05-04	C cvehunt session-0014: CVE-2017-1000117 (git ssh:// hostname-as-argv injection) added in corpus disabled — three-layer C engine gap: (a) array-element taint propagation through `args[i] = ssh_host;` writes, (b) missing `c.cmdi.exec*` AST patterns in `src/patterns/c.rs`, (c) sanitizer recognition of the upstream `if (ssh_host[0] == '-') die(...)` dash-prefix guard	565	1.000	1.000	1.000
2026-05-04	JS/TS array-method validator-callback narrowing (`try_array_method_validator_callback_narrowing` in `src/taint/ssa_transfer/mod.rs`) — `<arr>.filter(<isSafeXxx>)` / `.find` / `.findLast` strips `Cap::all()` from the call result when the callback resolves to a `BooleanTrueIsValid` validator; CVE-2026-42353 (i18next-http-middleware path traversal) re-enabled in ground truth, deferred queue cleared	563	1.000	1.000	1.000
2026-05-04	JS/TS ternary-RHS source-classification fix in `src/cfg/conditions.rs::lower_ternary_branch` (segment-strip first_member_label on the branch AST) — `let arr = cond ? req.query.lng : "";` now propagates taint through the diamond's join phi instead of lowering both branches to labelless Assign-with-empty-uses; CVE-2026-42353 (i18next-http-middleware path traversal / SSRF) added in corpus disabled — needs Array.prototype.filter(known_validator_callback) precision bridge	561	1.000	1.000	1.000
2026-05-04	PHP class-method body taint analysis (`declaration_list` / `interface_declaration` / `trait_declaration` / `enum_declaration` mapped to `Kind::Block` in `src/labels/php.rs`); PHP `unary_op_expression` recognised as negation in `detect_negation`; camelCase normalisation in `classify_condition` so `isSafeRemoteUrl(x)` classifies as ValidationCall the same as `is_safe_remote_url(x)`; PHP `$`-sigil stripping in `extract_validation_target`; `fopen` added as PHP SSRF sink; CVE-2026-33486 (roadiz/documents `DownloadedFile::fromUrl(file://)` SSRF/LFI) added	555	1.000	1.000	1.000
2026-05-04	Python Tier B `py.xss.make_response_format` AST pattern (Flask `make_response(<f-string>)` / `make_response(<concat>)`); CVE-2023-6568 (mlflow reflected XSS) and CVE-2024-21513 (langchain VectorSQLDatabaseChain `_try_eval` over DB rows) added	550	1.000	1.000	1.000
2026-05-03	Go for-range loop binding now defined from `range_clause` child of `for_statement` (was: tree-sitter wraps the binding/iterable on a child node; only direct `left`/`right` fields were consulted, so taint never reached the loop binding). gin sources extended to `c.QueryArray` / `c.GetQueryArray` / `c.PostFormArray` / `c.GetPostFormArray`. goqu raw SQL literal builders `goqu.L` / `goqu.Lit` recognised as SQL_QUERY sinks. CVE-2026-41422 (daptin aggregate API) detected	521	1.000	1.000	1.000
2026-05-02	TS regex-allowlist `<regex>.test(value)` / `<pattern>.test(value)` recognised as ValidationCall whose target is the first arg (overrides default receiver-as-target); conservative on receiver names so non-regex `*.test()` callees stay Unknown. CVE-2026-25544 (Payload drizzle SQL injection) lands in corpus disabled — needs validated-flow propagation through SSA derivation / helper-summary returns	499	1.000	1.000	1.000
2026-05-02	JS arrow `assignment_pattern` default-param extraction + JS object-literal kwarg fallback for gated sinks + double-call (`f()(x)`) chained-inner rebinding; lodash `_.template` modeled as gated CODE_EXEC sink suppressed by `{ evaluate: false }`; CVE-2023-22621 (Strapi SSTI) detected	494	—	—	—
2026-05-02	`strings.ReplaceAll` recognised as CMDi sanitiser in chain-wrapper / call-site-replace shapes; clears `go-safe-009` (last open corpus FP); aggregate rule-level reaches P=R=F1=1.000	492	1.000	1.000	1.000
2026-05-01	PathFact opaque-prefix-lock (`canonicalise + start_with?(<expr>)` recognised across Ruby/Python/JS) + `is_path_traversal_safe` predicate + negated-form polarity flip on assertion narrowing; rswag CVE-2023-38337 detected	490	0.972	0.992	0.982
2026-05-01	Ruby `OpenURI.open_uri` SSRF sink + inner-call fallback for statement-level Ruby calls (`YAML.safe_load(File.read(x))` shape now classifies); CVE-2021-21288 (CarrierWave) detected	482	0.972	0.992	0.982
2026-04-29	Java SnakeYAML + Text4Shell patterns; CVE-2022-1471 and CVE-2022-42889 detected	449	0.996	1.000	0.998
2026-04-29	Indirect-validator branch narrowing (`const err = validate(x); if (err) throw …;`) + helper-summary all_validated propagation; Novu GHSA-4x48-cgf9-q33f detected	445	0.991	1.000	0.995
2026-04-29	Python f-string SQLi pattern + bindparams sanitizer + HttpClient SSRF rules; CVE-2025-69662 (geopandas) and CVE-2026-33626 (LMDeploy) detected	439	0.991	1.000	0.995
2026-04-29	Phantom-Param-aware field suppression: CVE-2023-3188 detected, FP guards hold	432	0.995	1.000	0.998
2026-04-28	Ruby bare `Kernel#open` CMDI sink, exact-match sigil on label matchers	428	0.995	1.000	0.998
2026-04-28	Go SSRF/FILE_IO sink expansion (`http.DefaultClient.*`, `os.Remove`/`WriteFile`) plus Decode-writeback container op	426	0.995	1.000	0.998
2026-04-27	JS chained-method inner-gate classification (`http.get(u, cb).on(...)`)	422	0.994	1.000	0.997
2026-04-23	Auth FP remediation: 10 Rust ownership-check fixtures wired to corpus	305	0.946	0.994	0.970
2026-04-23	C and C++ added as first-class CVE-corpus languages (5 new CVE pairs)	295	0.945	0.994	0.969
2026-04-23	Go, Java, Ruby, PHP, plus second Python CVE pair	285	0.944	0.994	0.968
2026-04-23	Real-CVE replay corpus seeded (Python, JS, TS, one CVE per language)	273	0.942	0.994	0.967
2026-04-22	Cross-file points-to summaries, SCC joint fixed-point, backwards taint	273	0.940	0.994	0.966
2026-04-22	Cross-file context-sensitive inline taint (k=1)	270	0.940	0.994	0.966
2026-04-20	Rust weak-spot fixes across FILE_IO, SSRF, SQL, DESERIALIZE sink families	262	0.906	0.994	0.948
2026-04-20	TypeScript weak-spot fixes, Fastify framework detection, TSX/JSX grammar	262	0.899	0.981	0.938
2026-04-20	Rust corpus expansion: honest FNs in classes lacking Rust rules	262	0.891	0.961	0.925
2026-04-20	TypeScript corpus 0 to 32 cases across 12 vuln classes	246	0.904	0.986	0.944
2026-03-24	Benchmark expansion: C, C++, Rust as first-class; +73 cases	214	0.827	0.950	0.885
2026-03-22	Cross-file SSA validation, multi-file directory cases	141	0.840	0.975	0.903
2026-03-22	Ruby corpus 1 to 21 cases across 8 vuln classes	123	0.821	0.986	0.896
2026-03-22	SSA lowering hardening (PHP closures, Python try/except, exception edges)	103	0.841	0.983	0.906
2026-03-21	SSRF semantic completion (axios, got, undici, httpx, Net::HTTP, HTTParty)	103	0.671	0.966	0.792
2026-03-21	Constant-arg suppression at AST and CFG level	95	0.654	0.964	0.779
2026-03-21	Bare `exec`/`execSync` as JS CMDI sinks; Python `Template` as XSS sink	95	0.624	0.964	0.757
2026-03-21	First baseline after symbolic-strings work	95	0.620	0.891	0.731

Known limitations

These show up across multiple corpora and aren't fully fixed yet.

Variable-receiver method calls (client.send(...) vs HttpClient.send(...)) miss without an inferred receiver type. Type-aware callee resolution closes most cases; some residuals remain.
Arbitrary import aliases (from flask import request as r) aren't traced. Only explicitly listed aliases resolve.
URL-parsing isn't credited as SSRF sanitization. Allowlist checks in conditions are recognised; call-site sanitizers aren't.
Rust unguarded-sink still fires for shell-escape sinks when a source is in scope but not flowing to the sink arg. Intentional for high-risk classes.
Rust negative-validation patterns (contains dominators, match-arm guards) aren't recognised yet.
DNS rebinding and async-callback flows are out of scope for static analysis without runtime context.

16 KiB Raw Blame History