nyx/tests/benchmark/RESULTS.md
2026-05-04 19:58:04 -04:00

16 KiB

Benchmark Results

Current baseline (2026-05-02):

Metric File-level Rule-level CI floor
Precision 1.000 1.000 0.861
Recall 1.000 1.000 0.944
F1 1.000 1.000 0.901

Corpus: 507 cases across 10 languages, 504 evaluated (3 disabled). Per-run JSON lands in tests/benchmark/results/ (latest.json plus dated snapshots). See README.md for what the scoring modes mean and how to run a subset.

The corpus is mostly synthetic 8-20 line fixtures, one vulnerability or one safe pattern per file. A smaller real-CVE replay set under cve_corpus/ covers 30 published advisories across all 10 languages. Both contribute to the headline numbers.

Real CVE coverage

Real disclosed CVEs reduced to minimal reproducers, vulnerable + patched pair per CVE. Vulnerable fixtures must produce a finding for the disclosed sink class. Patched fixtures must produce zero findings.

CVE Language Project License Class Status
CVE-2023-48022 Python Ray Apache-2.0 CMDI detected
CVE-2017-18342 Python PyYAML MIT Deserialization detected
CVE-2025-69662 Python geopandas BSD-3-Clause SQL Injection detected
CVE-2026-33626 Python LMDeploy Apache-2.0 SSRF detected
CVE-2024-23334 Python aiohttp Apache-2.0 path_traversal detected
CVE-2023-6568 Python MLflow Apache-2.0 XSS detected
CVE-2024-21513 Python LangChain Experimental MIT code_exec detected
CVE-2019-14939 JavaScript mongo-express MIT code_exec detected
CVE-2025-64430 JavaScript Parse Server Apache-2.0 SSRF detected
CVE-2023-22621 JavaScript Strapi MIT code_exec (SSTI) detected
CVE-2023-26159 TypeScript follow-redirects MIT SSRF detected
GHSA-4x48-cgf9-q33f TypeScript Novu MIT SSRF detected
CVE-2022-30323 Go hashicorp/go-getter MPL-2.0 CMDI detected
CVE-2023-3188 Go owncast MIT SSRF detected
CVE-2024-31450 Go owncast MIT path_traversal detected
CVE-2026-41422 Go daptin LGPL-3.0 sql_injection detected
CVE-2015-7501 Java Apache Commons Collections Apache-2.0 Deserialization detected
CVE-2017-12629 Java Apache Solr Apache-2.0 CMDI detected
CVE-2022-1471 Java SnakeYAML Apache-2.0 Deserialization detected
CVE-2022-42889 Java Apache Commons Text Apache-2.0 code_exec detected
GHSA-h8cj-hpmg-636v Java Appsmith Apache-2.0 sql_injection detected
CVE-2013-0156 Ruby Ruby on Rails MIT Deserialization detected
CVE-2020-8130 Ruby Rake MIT CMDI detected
CVE-2021-21288 Ruby CarrierWave MIT SSRF detected
CVE-2023-38337 Ruby rswag MIT path_traversal detected
CVE-2017-9841 PHP PHPUnit BSD-3-Clause code_exec detected
CVE-2018-15133 PHP Laravel MIT Deserialization detected
CVE-2026-33486 PHP Roadiz CMS MIT SSRF detected
CVE-2018-20997 Rust tar-rs MIT OR Apache-2.0 path_traversal detected
CVE-2022-36113 Rust cargo MIT OR Apache-2.0 path_traversal detected
CVE-2023-42456 Rust sudo-rs Apache-2.0 path_traversal detected
CVE-2024-24576 Rust Rust stdlib MIT OR Apache-2.0 CMDI detected
CVE-2024-32884 Rust gitoxide Apache-2.0 OR MIT CMDI detected
CVE-2025-53549 Rust matrix-rust-sdk Apache-2.0 SQL Injection detected
CVE-2016-3714 C ImageMagick (ImageTragick) ImageMagick License CMDI detected
CVE-2017-1000117 C git (ssh:// argv injection) GPL-2.0 cmdi (argv-inj) deferred
CVE-2019-18634 C sudo (pwfeedback) ISC memory_safety detected
CVE-2019-13132 C++ ZeroMQ libzmq MPL-2.0 memory_safety detected
CVE-2022-1941 C++ Protocol Buffers BSD-3-Clause memory_safety detected
CVE-2026-25544 TypeScript Payload (Drizzle adapter) MIT sql_injection deferred
CVE-2026-42353 JavaScript i18next-http-middleware MIT path_traversal detected

Deferred entries are real bugs Nyx can't yet detect. The fixture stays committed with disabled: true in ground truth so the gap remains visible.

How CVEs get picked

  • Publicly disclosed with a stable advisory link.
  • Class Nyx already has a rule for, so the vulnerable fixture asserts on a concrete rule ID, not just a generic taint flow.
  • Reducible to roughly 30 lines without hiding the disclosed sink shape.
  • Permissive upstream license (MIT, Apache, BSD, MPL, ISC, ImageMagick).

Fixtures are minimal reproducers of the unsafe pattern, not verbatim upstream code.

CI floor

CI fails the build if rule-level precision drops below 0.861, recall below 0.944, or F1 below 0.901. Floors sit roughly 8 percentage points below the live baseline. A single-case flip is about 0.6 pp on this corpus, so the headroom absorbs honest FP/TN trades while still tripping on a class-level regression. Floors only move up, when a durable improvement lands. Never relax them to paper over a regression.

The gate runs in the benchmark-gate job in .github/workflows/ci.yml. Thresholds are encoded at the bottom of tests/benchmark_test.rs.

Recent changes

Most recent first. Metrics are rule-level on the corpus size at that point.

Date Change Corpus P R F1
2026-05-04 C cvehunt session-0014: CVE-2017-1000117 (git ssh:// hostname-as-argv injection) added in corpus disabled — three-layer C engine gap: (a) array-element taint propagation through args[i] = ssh_host; writes, (b) missing c.cmdi.exec* AST patterns in src/patterns/c.rs, (c) sanitizer recognition of the upstream if (ssh_host[0] == '-') die(...) dash-prefix guard 565 1.000 1.000 1.000
2026-05-04 JS/TS array-method validator-callback narrowing (try_array_method_validator_callback_narrowing in src/taint/ssa_transfer/mod.rs) — <arr>.filter(<isSafeXxx>) / .find / .findLast strips Cap::all() from the call result when the callback resolves to a BooleanTrueIsValid validator; CVE-2026-42353 (i18next-http-middleware path traversal) re-enabled in ground truth, deferred queue cleared 563 1.000 1.000 1.000
2026-05-04 JS/TS ternary-RHS source-classification fix in src/cfg/conditions.rs::lower_ternary_branch (segment-strip first_member_label on the branch AST) — let arr = cond ? req.query.lng : ""; now propagates taint through the diamond's join phi instead of lowering both branches to labelless Assign-with-empty-uses; CVE-2026-42353 (i18next-http-middleware path traversal / SSRF) added in corpus disabled — needs Array.prototype.filter(known_validator_callback) precision bridge 561 1.000 1.000 1.000
2026-05-04 PHP class-method body taint analysis (declaration_list / interface_declaration / trait_declaration / enum_declaration mapped to Kind::Block in src/labels/php.rs); PHP unary_op_expression recognised as negation in detect_negation; camelCase normalisation in classify_condition so isSafeRemoteUrl(x) classifies as ValidationCall the same as is_safe_remote_url(x); PHP $-sigil stripping in extract_validation_target; fopen added as PHP SSRF sink; CVE-2026-33486 (roadiz/documents DownloadedFile::fromUrl(file://) SSRF/LFI) added 555 1.000 1.000 1.000
2026-05-04 Python Tier B py.xss.make_response_format AST pattern (Flask make_response(<f-string>) / make_response(<concat>)); CVE-2023-6568 (mlflow reflected XSS) and CVE-2024-21513 (langchain VectorSQLDatabaseChain _try_eval over DB rows) added 550 1.000 1.000 1.000
2026-05-03 Go for-range loop binding now defined from range_clause child of for_statement (was: tree-sitter wraps the binding/iterable on a child node; only direct left/right fields were consulted, so taint never reached the loop binding). gin sources extended to c.QueryArray / c.GetQueryArray / c.PostFormArray / c.GetPostFormArray. goqu raw SQL literal builders goqu.L / goqu.Lit recognised as SQL_QUERY sinks. CVE-2026-41422 (daptin aggregate API) detected 521 1.000 1.000 1.000
2026-05-02 TS regex-allowlist <*regex*>.test(value) / <*pattern*>.test(value) recognised as ValidationCall whose target is the first arg (overrides default receiver-as-target); conservative on receiver names so non-regex *.test() callees stay Unknown. CVE-2026-25544 (Payload drizzle SQL injection) lands in corpus disabled — needs validated-flow propagation through SSA derivation / helper-summary returns 499 1.000 1.000 1.000
2026-05-02 JS arrow assignment_pattern default-param extraction + JS object-literal kwarg fallback for gated sinks + double-call (f()(x)) chained-inner rebinding; lodash _.template modeled as gated CODE_EXEC sink suppressed by { evaluate: false }; CVE-2023-22621 (Strapi SSTI) detected 494
2026-05-02 strings.ReplaceAll recognised as CMDi sanitiser in chain-wrapper / call-site-replace shapes; clears go-safe-009 (last open corpus FP); aggregate rule-level reaches P=R=F1=1.000 492 1.000 1.000 1.000
2026-05-01 PathFact opaque-prefix-lock (canonicalise + start_with?(<expr>) recognised across Ruby/Python/JS) + is_path_traversal_safe predicate + negated-form polarity flip on assertion narrowing; rswag CVE-2023-38337 detected 490 0.972 0.992 0.982
2026-05-01 Ruby OpenURI.open_uri SSRF sink + inner-call fallback for statement-level Ruby calls (YAML.safe_load(File.read(x)) shape now classifies); CVE-2021-21288 (CarrierWave) detected 482 0.972 0.992 0.982
2026-04-29 Java SnakeYAML + Text4Shell patterns; CVE-2022-1471 and CVE-2022-42889 detected 449 0.996 1.000 0.998
2026-04-29 Indirect-validator branch narrowing (const err = validate(x); if (err) throw …;) + helper-summary all_validated propagation; Novu GHSA-4x48-cgf9-q33f detected 445 0.991 1.000 0.995
2026-04-29 Python f-string SQLi pattern + bindparams sanitizer + HttpClient SSRF rules; CVE-2025-69662 (geopandas) and CVE-2026-33626 (LMDeploy) detected 439 0.991 1.000 0.995
2026-04-29 Phantom-Param-aware field suppression: CVE-2023-3188 detected, FP guards hold 432 0.995 1.000 0.998
2026-04-28 Ruby bare Kernel#open CMDI sink, exact-match sigil on label matchers 428 0.995 1.000 0.998
2026-04-28 Go SSRF/FILE_IO sink expansion (http.DefaultClient.*, os.Remove/WriteFile) plus Decode-writeback container op 426 0.995 1.000 0.998
2026-04-27 JS chained-method inner-gate classification (http.get(u, cb).on(...)) 422 0.994 1.000 0.997
2026-04-23 Auth FP remediation: 10 Rust ownership-check fixtures wired to corpus 305 0.946 0.994 0.970
2026-04-23 C and C++ added as first-class CVE-corpus languages (5 new CVE pairs) 295 0.945 0.994 0.969
2026-04-23 Go, Java, Ruby, PHP, plus second Python CVE pair 285 0.944 0.994 0.968
2026-04-23 Real-CVE replay corpus seeded (Python, JS, TS, one CVE per language) 273 0.942 0.994 0.967
2026-04-22 Cross-file points-to summaries, SCC joint fixed-point, backwards taint 273 0.940 0.994 0.966
2026-04-22 Cross-file context-sensitive inline taint (k=1) 270 0.940 0.994 0.966
2026-04-20 Rust weak-spot fixes across FILE_IO, SSRF, SQL, DESERIALIZE sink families 262 0.906 0.994 0.948
2026-04-20 TypeScript weak-spot fixes, Fastify framework detection, TSX/JSX grammar 262 0.899 0.981 0.938
2026-04-20 Rust corpus expansion: honest FNs in classes lacking Rust rules 262 0.891 0.961 0.925
2026-04-20 TypeScript corpus 0 to 32 cases across 12 vuln classes 246 0.904 0.986 0.944
2026-03-24 Benchmark expansion: C, C++, Rust as first-class; +73 cases 214 0.827 0.950 0.885
2026-03-22 Cross-file SSA validation, multi-file directory cases 141 0.840 0.975 0.903
2026-03-22 Ruby corpus 1 to 21 cases across 8 vuln classes 123 0.821 0.986 0.896
2026-03-22 SSA lowering hardening (PHP closures, Python try/except, exception edges) 103 0.841 0.983 0.906
2026-03-21 SSRF semantic completion (axios, got, undici, httpx, Net::HTTP, HTTParty) 103 0.671 0.966 0.792
2026-03-21 Constant-arg suppression at AST and CFG level 95 0.654 0.964 0.779
2026-03-21 Bare exec/execSync as JS CMDI sinks; Python Template as XSS sink 95 0.624 0.964 0.757
2026-03-21 First baseline after symbolic-strings work 95 0.620 0.891 0.731

Known limitations

These show up across multiple corpora and aren't fully fixed yet.

  • Variable-receiver method calls (client.send(...) vs HttpClient.send(...)) miss without an inferred receiver type. Type-aware callee resolution closes most cases; some residuals remain.
  • Arbitrary import aliases (from flask import request as r) aren't traced. Only explicitly listed aliases resolve.
  • URL-parsing isn't credited as SSRF sanitization. Allowlist checks in conditions are recognised; call-site sanitizers aren't.
  • Rust unguarded-sink still fires for shell-escape sinks when a source is in scope but not flowing to the sink arg. Intentional for high-risk classes.
  • Rust negative-validation patterns (contains dominators, match-arm guards) aren't recognised yet.
  • DNS rebinding and async-callback flows are out of scope for static analysis without runtime context.