nyx/tests/benchmark/RESULTS.md
Eli Peter 82f18184b1
Prerelease cleanup (#46)
* feat: Add const_bound_vars tracking to prevent false positives in ownership checks

* feat: Introduce field interner and typed bounded vars for enhanced type tracking

* feat: Add typed_call_receivers and typed_bounded_dto_fields for enhanced type tracking

* feat: Centralize method name extraction with bare_method_name helper

* feat: Implement Phase-6 hierarchy fan-out for runtime virtual dispatch

* feat: Enhance C++ taint tracking with additional container operations and inline method resolution

* feat: Introduce field-sensitive points-to analysis for enhanced resource tracking

* feat: Implement Pointer-Phase 6 subscript handling for enhanced container analysis

* test: Add comprehensive tests for JavaScript control flow constructs and lattice operations

* docs: Update advanced analysis documentation with field-sensitive points-to and hierarchy fan-out details

* test: Add comprehensive tests for lattice algebra laws and SSA edge cases

* feat: Add destructured session user handling and safe user ID access patterns

* feat: Implement row-population reverse-walk for enhanced authorization checks

* feat: Enhance authorization checks with local alias chain for self-actor types

* feat: Introduce ActiveRecord query safety checks and enhance snippet extraction

* feat: Implement chained method call inner-gate rebinding for SSRF prevention

* feat: Add observability and error modules, enhance debug functionality, and implement theme context

* feat: Remove Auth Analysis page and update navigation to redirect to Explorer

* feat: Optimize SSA lowering by sharing results between taint engine and artifact extractor

* feat: Optimize SSA lowering by sharing results between taint engine and artifact extractor

* feat: Reset path-safe-suppressed spans before lowering to maintain analysis integrity

* fix(ssa): ungate debug_assert_bfs_ordering for release-tests build

The helper at src/ssa/lower.rs was gated `#[cfg(debug_assertions)]` while
the unit test at the bottom of the file was gated only `#[cfg(test)]`.
Since `cfg(test)` is set in release builds with `--tests` but
`cfg(debug_assertions)` is not, `cargo build --release --tests` failed
with E0425. Removing the gate fixes the build; the body is `debug_assert!`
only, so the helper is free in release. Also drop the gate at the call
site to avoid a `dead_code` warning when the lib is built without
`--tests`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(closure-capture): flip JS/TS fixtures to required-finding

The JS and TS closure-capture fixtures pinned the old broken behaviour
via `forbidden_findings: [{ "id_prefix": "taint-" }]`. The engine now
correctly traces taint through the closure boundary (env source captured
by an arrow function, sunk via `child_process.exec` inside the body), so
the formerly-forbidden finding is a true positive.

Match the Python sibling's shape — `required_findings` with
`id_prefix` + `min_count` plus a small `noise_budget` — and rewrite the
companion READMEs and the phase8_fragility_tests doc-comments from
"known gap" to "regression guard".

Verified:
- cargo test --release --test phase8_fragility_tests → 8/8 pass
- cargo test --release --lib bfs_assertion → pass
- corpus benchmark F1 = 0.9976 (TP=205, FP=1, FN=0) — unchanged

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: Add OWASP mapping and baseline mutation hooks for enhanced security analysis

* feat: Introduce health module and enhance health score computation with calibration tests

* feat: Add expectations configuration and cleanup .gitignore for log files

* feat: Implement theme selection and enhance settings panel for triage sync

* feat: Suppress false positives for strcpy calls with literal sources in AST

* feat: Update analyse_function_ssa to return body CFG for accurate analysis

* feat: Add bug report and feature request templates for improved issue tracking

* feat: removed dev scripts

* feat: update README.md for clarity and consistency in fixture descriptions

* feat: removed dev docs

* feat: clean up error handling and UI elements for improved user experience

* feat: adjust button sizes in HeaderBar for better UI consistency

* feat: enhance taint analysis with additional context for sanitizer and taint findings

* cargo fmt

* prettier

* refactor: simplify conditional checks and improve code readability in AST and screenshot capture scripts

* feat: add script to frame PNG screenshots with brand gradient

* feat: add fuzzing support with new targets and CI workflows

* refactor: streamline match expressions and improve formatting in CLI and output handling

* feat: enhance configuration display with detailed output options

* feat: stage demo configuration for improved CLI screenshot output

* feat: expose merge_configs function for user-configurable settings

* refactor: simplify code structure and improve readability in config handling

* refactor: improve descriptions for vulnerability patterns in various languages

* feat: update MIT License section with additional usage details and copyright information

* feat: update screenshots

* refactor: update build process and paths for frontend assets

* feat: add cross-file taint fuzzing target and supporting dictionary

* refactor: clean up formatting and comments in fuzz configuration and example files

* refactor: remove outdated comments and clean up CI configuration files

* chore: update changelog dates and improve formatting in documentation

* refactor: update Cargo.toml and CI configuration for improved packaging and build process

* refactor: enhance quote-stripping logic to prevent panics and add regression tests

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:58:38 -04:00

8.2 KiB

Benchmark Results

Current baseline (2026-04-29):

Metric File-level Rule-level CI floor
Precision 0.991 0.991 0.861
Recall 0.995 0.995 0.944
F1 0.993 0.993 0.901

Corpus: 433 cases across 10 languages, 432 evaluated (1 disabled). Per-run JSON lands in tests/benchmark/results/ (latest.json plus dated snapshots). See README.md for what the scoring modes mean and how to run a subset.

The corpus is mostly synthetic 8-20 line fixtures, one vulnerability or one safe pattern per file. A smaller real-CVE replay set under cve_corpus/ covers 18 published CVEs across all 10 languages. Both contribute to the headline numbers.

Real CVE coverage

Real disclosed CVEs reduced to minimal reproducers, vulnerable + patched pair per CVE. Vulnerable fixtures must produce a finding for the disclosed sink class. Patched fixtures must produce zero findings.

CVE Language Project License Class Status
CVE-2023-48022 Python Ray Apache-2.0 CMDI detected
CVE-2017-18342 Python PyYAML MIT Deserialization detected
CVE-2019-14939 JavaScript mongo-express MIT code_exec detected
CVE-2025-64430 JavaScript Parse Server Apache-2.0 SSRF detected
CVE-2023-26159 TypeScript follow-redirects MIT SSRF detected
CVE-2022-30323 Go hashicorp/go-getter MPL-2.0 CMDI detected
CVE-2023-3188 Go owncast MIT SSRF open FN
CVE-2024-31450 Go owncast MIT path_traversal detected
CVE-2015-7501 Java Apache Commons Collections Apache-2.0 Deserialization detected
CVE-2017-12629 Java Apache Solr Apache-2.0 CMDI detected
CVE-2013-0156 Ruby Ruby on Rails MIT Deserialization detected
CVE-2020-8130 Ruby Rake MIT CMDI detected
CVE-2017-9841 PHP PHPUnit BSD-3-Clause code_exec detected
CVE-2018-15133 PHP Laravel MIT Deserialization detected
CVE-2016-3714 C ImageMagick (ImageTragick) ImageMagick License CMDI detected
CVE-2019-18634 C sudo (pwfeedback) ISC memory_safety detected
CVE-2019-13132 C++ ZeroMQ libzmq MPL-2.0 memory_safety detected
CVE-2022-1941 C++ Protocol Buffers BSD-3-Clause memory_safety detected

Deferred entries are real bugs Nyx can't yet detect. The fixture stays committed with disabled: true in ground truth so the gap remains visible.

How CVEs get picked

  • Publicly disclosed with a stable advisory link.
  • Class Nyx already has a rule for, so the vulnerable fixture asserts on a concrete rule ID, not just a generic taint flow.
  • Reducible to roughly 30 lines without hiding the disclosed sink shape.
  • Permissive upstream license (MIT, Apache, BSD, MPL, ISC, ImageMagick).

Fixtures are minimal reproducers of the unsafe pattern, not verbatim upstream code.

CI floor

CI fails the build if rule-level precision drops below 0.861, recall below 0.944, or F1 below 0.901. Floors sit roughly 8 percentage points below the live baseline. A single-case flip is about 0.6 pp on this corpus, so the headroom absorbs honest FP/TN trades while still tripping on a class-level regression. Floors only move up, when a durable improvement lands. Never relax them to paper over a regression.

The gate runs in the benchmark-gate job in .github/workflows/ci.yml. Thresholds are encoded at the bottom of tests/benchmark_test.rs.

Recent changes

Most recent first. Metrics are rule-level on the corpus size at that point.

Date Change Corpus P R F1
2026-04-28 Ruby bare Kernel#open CMDI sink, exact-match sigil on label matchers 428 0.995 1.000 0.998
2026-04-28 Go SSRF/FILE_IO sink expansion (http.DefaultClient.*, os.Remove/WriteFile) plus Decode-writeback container op 426 0.995 1.000 0.998
2026-04-27 JS chained-method inner-gate classification (http.get(u, cb).on(...)) 422 0.994 1.000 0.997
2026-04-23 Auth FP remediation: 10 Rust ownership-check fixtures wired to corpus 305 0.946 0.994 0.970
2026-04-23 C and C++ added as first-class CVE-corpus languages (5 new CVE pairs) 295 0.945 0.994 0.969
2026-04-23 Go, Java, Ruby, PHP, plus second Python CVE pair 285 0.944 0.994 0.968
2026-04-23 Real-CVE replay corpus seeded (Python, JS, TS, one CVE per language) 273 0.942 0.994 0.967
2026-04-22 Cross-file points-to summaries, SCC joint fixed-point, backwards taint 273 0.940 0.994 0.966
2026-04-22 Cross-file context-sensitive inline taint (k=1) 270 0.940 0.994 0.966
2026-04-20 Rust weak-spot fixes across FILE_IO, SSRF, SQL, DESERIALIZE sink families 262 0.906 0.994 0.948
2026-04-20 TypeScript weak-spot fixes, Fastify framework detection, TSX/JSX grammar 262 0.899 0.981 0.938
2026-04-20 Rust corpus expansion: honest FNs in classes lacking Rust rules 262 0.891 0.961 0.925
2026-04-20 TypeScript corpus 0 to 32 cases across 12 vuln classes 246 0.904 0.986 0.944
2026-03-24 Benchmark expansion: C, C++, Rust as first-class; +73 cases 214 0.827 0.950 0.885
2026-03-22 Cross-file SSA validation, multi-file directory cases 141 0.840 0.975 0.903
2026-03-22 Ruby corpus 1 to 21 cases across 8 vuln classes 123 0.821 0.986 0.896
2026-03-22 SSA lowering hardening (PHP closures, Python try/except, exception edges) 103 0.841 0.983 0.906
2026-03-21 SSRF semantic completion (axios, got, undici, httpx, Net::HTTP, HTTParty) 103 0.671 0.966 0.792
2026-03-21 Constant-arg suppression at AST and CFG level 95 0.654 0.964 0.779
2026-03-21 Bare exec/execSync as JS CMDI sinks; Python Template as XSS sink 95 0.624 0.964 0.757
2026-03-21 First baseline after symbolic-strings work 95 0.620 0.891 0.731

Known limitations

These show up across multiple corpora and aren't fully fixed yet.

  • Variable-receiver method calls (client.send(...) vs HttpClient.send(...)) miss without an inferred receiver type. Type-aware callee resolution closes most cases; some residuals remain.
  • Arbitrary import aliases (from flask import request as r) aren't traced. Only explicitly listed aliases resolve.
  • URL-parsing isn't credited as SSRF sanitization. Allowlist checks in conditions are recognised; call-site sanitizers aren't.
  • Rust unguarded-sink still fires for shell-escape sinks when a source is in scope but not flowing to the sink arg. Intentional for high-risk classes.
  • Rust negative-validation patterns (contains dominators, match-arm guards) aren't recognised yet.
  • DNS rebinding and async-callback flows are out of scope for static analysis without runtime context.