nyx/tests/benchmark/RESULTS.md

# Benchmark Results

Current baseline (2026-04-29):

| Metric    | File-level | Rule-level | CI floor |
|-----------|------------|------------|----------|
| Precision | 0.991      | 0.991      | 0.861    |
| Recall    | 0.995      | 0.995      | 0.944    |
| F1        | 0.993      | 0.993      | 0.901    |

Corpus: 433 cases across 10 languages, 432 evaluated (1 disabled). Per-run JSON lands in `tests/benchmark/results/` (`latest.json` plus dated snapshots). See `README.md` for what the scoring modes mean and how to run a subset.

The corpus is mostly synthetic 8-20 line fixtures, one vulnerability or one safe pattern per file. A smaller real-CVE replay set under `cve_corpus/` covers 18 published CVEs across all 10 languages. Both contribute to the headline numbers.

## Real CVE coverage

Real disclosed CVEs reduced to minimal reproducers, vulnerable + patched pair per CVE. Vulnerable fixtures must produce a finding for the disclosed sink class. Patched fixtures must produce zero findings.

| CVE            | Language   | Project                    | License              | Class           | Status   |
|----------------|------------|----------------------------|----------------------|-----------------|----------|
| CVE-2023-48022 | Python     | Ray                        | Apache-2.0           | CMDI            | detected |
| CVE-2017-18342 | Python     | PyYAML                     | MIT                  | Deserialization | detected |
| CVE-2019-14939 | JavaScript | mongo-express              | MIT                  | code_exec       | detected |
| CVE-2025-64430 | JavaScript | Parse Server               | Apache-2.0           | SSRF            | detected |
| CVE-2023-26159 | TypeScript | follow-redirects           | MIT                  | SSRF            | detected |
| CVE-2022-30323 | Go         | hashicorp/go-getter        | MPL-2.0              | CMDI            | detected |
| CVE-2023-3188  | Go         | owncast                    | MIT                  | SSRF            | open FN  |
| CVE-2024-31450 | Go         | owncast                    | MIT                  | path_traversal  | detected |
| CVE-2015-7501  | Java       | Apache Commons Collections | Apache-2.0           | Deserialization | detected |
| CVE-2017-12629 | Java       | Apache Solr                | Apache-2.0           | CMDI            | detected |
| CVE-2013-0156  | Ruby       | Ruby on Rails              | MIT                  | Deserialization | detected |
| CVE-2020-8130  | Ruby       | Rake                       | MIT                  | CMDI            | detected |
| CVE-2017-9841  | PHP        | PHPUnit                    | BSD-3-Clause         | code_exec       | detected |
| CVE-2018-15133 | PHP        | Laravel                    | MIT                  | Deserialization | detected |
| CVE-2016-3714  | C          | ImageMagick (ImageTragick) | ImageMagick License  | CMDI            | detected |
| CVE-2019-18634 | C          | sudo (pwfeedback)          | ISC                  | memory_safety   | detected |
| CVE-2019-13132 | C++        | ZeroMQ libzmq              | MPL-2.0              | memory_safety   | detected |
| CVE-2022-1941  | C++        | Protocol Buffers           | BSD-3-Clause         | memory_safety   | detected |

Deferred entries are real bugs Nyx can't yet detect. The fixture stays committed with `disabled: true` in ground truth so the gap remains visible.

### How CVEs get picked

- Publicly disclosed with a stable advisory link.
- Class Nyx already has a rule for, so the vulnerable fixture asserts on a concrete rule ID, not just a generic taint flow.
- Reducible to roughly 30 lines without hiding the disclosed sink shape.
- Permissive upstream license (MIT, Apache, BSD, MPL, ISC, ImageMagick).

Fixtures are minimal reproducers of the unsafe pattern, not verbatim upstream code.

## CI floor

CI fails the build if rule-level precision drops below 0.861, recall below 0.944, or F1 below 0.901. Floors sit roughly 8 percentage points below the live baseline. A single-case flip is about 0.6 pp on this corpus, so the headroom absorbs honest FP/TN trades while still tripping on a class-level regression. Floors only move up, when a durable improvement lands. Never relax them to paper over a regression.

The gate runs in the `benchmark-gate` job in `.github/workflows/ci.yml`. Thresholds are encoded at the bottom of `tests/benchmark_test.rs`.

## Recent changes

Most recent first. Metrics are rule-level on the corpus size at that point.

| Date       | Change                                                                       | Corpus | P     | R     | F1    |
|------------|------------------------------------------------------------------------------|--------|-------|-------|-------|
| 2026-04-28 | Ruby bare `Kernel#open` CMDI sink, exact-match sigil on label matchers        | 428    | 0.995 | 1.000 | 0.998 |
| 2026-04-28 | Go SSRF/FILE_IO sink expansion (`http.DefaultClient.*`, `os.Remove`/`WriteFile`) plus Decode-writeback container op | 426 | 0.995 | 1.000 | 0.998 |
| 2026-04-27 | JS chained-method inner-gate classification (`http.get(u, cb).on(...)`)      | 422    | 0.994 | 1.000 | 0.997 |
| 2026-04-23 | Auth FP remediation: 10 Rust ownership-check fixtures wired to corpus         | 305    | 0.946 | 0.994 | 0.970 |
| 2026-04-23 | C and C++ added as first-class CVE-corpus languages (5 new CVE pairs)         | 295    | 0.945 | 0.994 | 0.969 |
| 2026-04-23 | Go, Java, Ruby, PHP, plus second Python CVE pair                              | 285    | 0.944 | 0.994 | 0.968 |
| 2026-04-23 | Real-CVE replay corpus seeded (Python, JS, TS, one CVE per language)          | 273    | 0.942 | 0.994 | 0.967 |
| 2026-04-22 | Cross-file points-to summaries, SCC joint fixed-point, backwards taint        | 273    | 0.940 | 0.994 | 0.966 |
| 2026-04-22 | Cross-file context-sensitive inline taint (k=1)                               | 270    | 0.940 | 0.994 | 0.966 |
| 2026-04-20 | Rust weak-spot fixes across FILE_IO, SSRF, SQL, DESERIALIZE sink families     | 262    | 0.906 | 0.994 | 0.948 |
| 2026-04-20 | TypeScript weak-spot fixes, Fastify framework detection, TSX/JSX grammar      | 262    | 0.899 | 0.981 | 0.938 |
| 2026-04-20 | Rust corpus expansion: honest FNs in classes lacking Rust rules               | 262    | 0.891 | 0.961 | 0.925 |
| 2026-04-20 | TypeScript corpus 0 to 32 cases across 12 vuln classes                        | 246    | 0.904 | 0.986 | 0.944 |
| 2026-03-24 | Benchmark expansion: C, C++, Rust as first-class; +73 cases                   | 214    | 0.827 | 0.950 | 0.885 |
| 2026-03-22 | Cross-file SSA validation, multi-file directory cases                         | 141    | 0.840 | 0.975 | 0.903 |
| 2026-03-22 | Ruby corpus 1 to 21 cases across 8 vuln classes                               | 123    | 0.821 | 0.986 | 0.896 |
| 2026-03-22 | SSA lowering hardening (PHP closures, Python try/except, exception edges)     | 103    | 0.841 | 0.983 | 0.906 |
| 2026-03-21 | SSRF semantic completion (axios, got, undici, httpx, Net::HTTP, HTTParty)     | 103    | 0.671 | 0.966 | 0.792 |
| 2026-03-21 | Constant-arg suppression at AST and CFG level                                 | 95     | 0.654 | 0.964 | 0.779 |
| 2026-03-21 | Bare `exec`/`execSync` as JS CMDI sinks; Python `Template` as XSS sink        | 95     | 0.624 | 0.964 | 0.757 |
| 2026-03-21 | First baseline after symbolic-strings work                                    | 95     | 0.620 | 0.891 | 0.731 |

## Known limitations

These show up across multiple corpora and aren't fully fixed yet.

- **Variable-receiver method calls** (`client.send(...)` vs `HttpClient.send(...)`) miss without an inferred receiver type. Type-aware callee resolution closes most cases; some residuals remain.
- **Arbitrary import aliases** (`from flask import request as r`) aren't traced. Only explicitly listed aliases resolve.
- **URL-parsing isn't credited as SSRF sanitization.** Allowlist checks in conditions are recognised; call-site sanitizers aren't.
- **Rust unguarded-sink** still fires for shell-escape sinks when a source is in scope but not flowing to the sink arg. Intentional for high-risk classes.
- **Rust negative-validation** patterns (`contains` dominators, match-arm guards) aren't recognised yet.
- **DNS rebinding and async-callback flows** are out of scope for static analysis without runtime context.
Prerelease cleanup (#46) * feat: Add const_bound_vars tracking to prevent false positives in ownership checks * feat: Introduce field interner and typed bounded vars for enhanced type tracking * feat: Add typed_call_receivers and typed_bounded_dto_fields for enhanced type tracking * feat: Centralize method name extraction with bare_method_name helper * feat: Implement Phase-6 hierarchy fan-out for runtime virtual dispatch * feat: Enhance C++ taint tracking with additional container operations and inline method resolution * feat: Introduce field-sensitive points-to analysis for enhanced resource tracking * feat: Implement Pointer-Phase 6 subscript handling for enhanced container analysis * test: Add comprehensive tests for JavaScript control flow constructs and lattice operations * docs: Update advanced analysis documentation with field-sensitive points-to and hierarchy fan-out details * test: Add comprehensive tests for lattice algebra laws and SSA edge cases * feat: Add destructured session user handling and safe user ID access patterns * feat: Implement row-population reverse-walk for enhanced authorization checks * feat: Enhance authorization checks with local alias chain for self-actor types * feat: Introduce ActiveRecord query safety checks and enhance snippet extraction * feat: Implement chained method call inner-gate rebinding for SSRF prevention * feat: Add observability and error modules, enhance debug functionality, and implement theme context * feat: Remove Auth Analysis page and update navigation to redirect to Explorer * feat: Optimize SSA lowering by sharing results between taint engine and artifact extractor * feat: Optimize SSA lowering by sharing results between taint engine and artifact extractor * feat: Reset path-safe-suppressed spans before lowering to maintain analysis integrity * fix(ssa): ungate debug_assert_bfs_ordering for release-tests build The helper at src/ssa/lower.rs was gated `#[cfg(debug_assertions)]` while the unit test at the bottom of the file was gated only `#[cfg(test)]`. Since `cfg(test)` is set in release builds with `--tests` but `cfg(debug_assertions)` is not, `cargo build --release --tests` failed with E0425. Removing the gate fixes the build; the body is `debug_assert!` only, so the helper is free in release. Also drop the gate at the call site to avoid a `dead_code` warning when the lib is built without `--tests`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(closure-capture): flip JS/TS fixtures to required-finding The JS and TS closure-capture fixtures pinned the old broken behaviour via `forbidden_findings: [{ "id_prefix": "taint-" }]`. The engine now correctly traces taint through the closure boundary (env source captured by an arrow function, sunk via `child_process.exec` inside the body), so the formerly-forbidden finding is a true positive. Match the Python sibling's shape — `required_findings` with `id_prefix` + `min_count` plus a small `noise_budget` — and rewrite the companion READMEs and the phase8_fragility_tests doc-comments from "known gap" to "regression guard". Verified: - cargo test --release --test phase8_fragility_tests → 8/8 pass - cargo test --release --lib bfs_assertion → pass - corpus benchmark F1 = 0.9976 (TP=205, FP=1, FN=0) — unchanged Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: Add OWASP mapping and baseline mutation hooks for enhanced security analysis * feat: Introduce health module and enhance health score computation with calibration tests * feat: Add expectations configuration and cleanup .gitignore for log files * feat: Implement theme selection and enhance settings panel for triage sync * feat: Suppress false positives for strcpy calls with literal sources in AST * feat: Update analyse_function_ssa to return body CFG for accurate analysis * feat: Add bug report and feature request templates for improved issue tracking * feat: removed dev scripts * feat: update README.md for clarity and consistency in fixture descriptions * feat: removed dev docs * feat: clean up error handling and UI elements for improved user experience * feat: adjust button sizes in HeaderBar for better UI consistency * feat: enhance taint analysis with additional context for sanitizer and taint findings * cargo fmt * prettier * refactor: simplify conditional checks and improve code readability in AST and screenshot capture scripts * feat: add script to frame PNG screenshots with brand gradient * feat: add fuzzing support with new targets and CI workflows * refactor: streamline match expressions and improve formatting in CLI and output handling * feat: enhance configuration display with detailed output options * feat: stage demo configuration for improved CLI screenshot output * feat: expose merge_configs function for user-configurable settings * refactor: simplify code structure and improve readability in config handling * refactor: improve descriptions for vulnerability patterns in various languages * feat: update MIT License section with additional usage details and copyright information * feat: update screenshots * refactor: update build process and paths for frontend assets * feat: add cross-file taint fuzzing target and supporting dictionary * refactor: clean up formatting and comments in fuzz configuration and example files * refactor: remove outdated comments and clean up CI configuration files * chore: update changelog dates and improve formatting in documentation * refactor: update Cargo.toml and CI configuration for improved packaging and build process * refactor: enhance quote-stripping logic to prevent panics and add regression tests --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-04-29 00:58:38 -04:00			`# Benchmark Results`

			`Current baseline (2026-04-29):`

			`\| Metric \| File-level \| Rule-level \| CI floor \|`
			`\|-----------\|------------\|------------\|----------\|`
			`\| Precision \| 0.991 \| 0.991 \| 0.861 \|`
			`\| Recall \| 0.995 \| 0.995 \| 0.944 \|`
			`\| F1 \| 0.993 \| 0.993 \| 0.901 \|`

			Corpus: 433 cases across 10 languages, 432 evaluated (1 disabled). Per-run JSON lands in `tests/benchmark/results/` (`latest.json` plus dated snapshots). See `README.md` for what the scoring modes mean and how to run a subset.

			The corpus is mostly synthetic 8-20 line fixtures, one vulnerability or one safe pattern per file. A smaller real-CVE replay set under `cve_corpus/` covers 18 published CVEs across all 10 languages. Both contribute to the headline numbers.

			`## Real CVE coverage`

			`Real disclosed CVEs reduced to minimal reproducers, vulnerable + patched pair per CVE. Vulnerable fixtures must produce a finding for the disclosed sink class. Patched fixtures must produce zero findings.`

			`\| CVE \| Language \| Project \| License \| Class \| Status \|`
			`\|----------------\|------------\|----------------------------\|----------------------\|-----------------\|----------\|`
			`\| CVE-2023-48022 \| Python \| Ray \| Apache-2.0 \| CMDI \| detected \|`
			`\| CVE-2017-18342 \| Python \| PyYAML \| MIT \| Deserialization \| detected \|`
			`\| CVE-2019-14939 \| JavaScript \| mongo-express \| MIT \| code_exec \| detected \|`
			`\| CVE-2025-64430 \| JavaScript \| Parse Server \| Apache-2.0 \| SSRF \| detected \|`
			`\| CVE-2023-26159 \| TypeScript \| follow-redirects \| MIT \| SSRF \| detected \|`
			`\| CVE-2022-30323 \| Go \| hashicorp/go-getter \| MPL-2.0 \| CMDI \| detected \|`
			`\| CVE-2023-3188 \| Go \| owncast \| MIT \| SSRF \| open FN \|`
			`\| CVE-2024-31450 \| Go \| owncast \| MIT \| path_traversal \| detected \|`
			`\| CVE-2015-7501 \| Java \| Apache Commons Collections \| Apache-2.0 \| Deserialization \| detected \|`
			`\| CVE-2017-12629 \| Java \| Apache Solr \| Apache-2.0 \| CMDI \| detected \|`
			`\| CVE-2013-0156 \| Ruby \| Ruby on Rails \| MIT \| Deserialization \| detected \|`
			`\| CVE-2020-8130 \| Ruby \| Rake \| MIT \| CMDI \| detected \|`
			`\| CVE-2017-9841 \| PHP \| PHPUnit \| BSD-3-Clause \| code_exec \| detected \|`
			`\| CVE-2018-15133 \| PHP \| Laravel \| MIT \| Deserialization \| detected \|`
			`\| CVE-2016-3714 \| C \| ImageMagick (ImageTragick) \| ImageMagick License \| CMDI \| detected \|`
			`\| CVE-2019-18634 \| C \| sudo (pwfeedback) \| ISC \| memory_safety \| detected \|`
			`\| CVE-2019-13132 \| C++ \| ZeroMQ libzmq \| MPL-2.0 \| memory_safety \| detected \|`
			`\| CVE-2022-1941 \| C++ \| Protocol Buffers \| BSD-3-Clause \| memory_safety \| detected \|`

			Deferred entries are real bugs Nyx can't yet detect. The fixture stays committed with `disabled: true` in ground truth so the gap remains visible.

			`### How CVEs get picked`

			`- Publicly disclosed with a stable advisory link.`
			`- Class Nyx already has a rule for, so the vulnerable fixture asserts on a concrete rule ID, not just a generic taint flow.`
			`- Reducible to roughly 30 lines without hiding the disclosed sink shape.`
			`- Permissive upstream license (MIT, Apache, BSD, MPL, ISC, ImageMagick).`

			`Fixtures are minimal reproducers of the unsafe pattern, not verbatim upstream code.`

			`## CI floor`

			`CI fails the build if rule-level precision drops below 0.861, recall below 0.944, or F1 below 0.901. Floors sit roughly 8 percentage points below the live baseline. A single-case flip is about 0.6 pp on this corpus, so the headroom absorbs honest FP/TN trades while still tripping on a class-level regression. Floors only move up, when a durable improvement lands. Never relax them to paper over a regression.`

			The gate runs in the `benchmark-gate` job in `.github/workflows/ci.yml`. Thresholds are encoded at the bottom of `tests/benchmark_test.rs`.

			`## Recent changes`

			`Most recent first. Metrics are rule-level on the corpus size at that point.`

			`\| Date \| Change \| Corpus \| P \| R \| F1 \|`
			`\|------------\|------------------------------------------------------------------------------\|--------\|-------\|-------\|-------\|`
			\| 2026-04-28 \| Ruby bare `Kernel#open` CMDI sink, exact-match sigil on label matchers \| 428 \| 0.995 \| 1.000 \| 0.998 \|
			\| 2026-04-28 \| Go SSRF/FILE_IO sink expansion (`http.DefaultClient.*`, `os.Remove`/`WriteFile`) plus Decode-writeback container op \| 426 \| 0.995 \| 1.000 \| 0.998 \|
			\| 2026-04-27 \| JS chained-method inner-gate classification (`http.get(u, cb).on(...)`) \| 422 \| 0.994 \| 1.000 \| 0.997 \|
			`\| 2026-04-23 \| Auth FP remediation: 10 Rust ownership-check fixtures wired to corpus \| 305 \| 0.946 \| 0.994 \| 0.970 \|`
			`\| 2026-04-23 \| C and C++ added as first-class CVE-corpus languages (5 new CVE pairs) \| 295 \| 0.945 \| 0.994 \| 0.969 \|`
			`\| 2026-04-23 \| Go, Java, Ruby, PHP, plus second Python CVE pair \| 285 \| 0.944 \| 0.994 \| 0.968 \|`
			`\| 2026-04-23 \| Real-CVE replay corpus seeded (Python, JS, TS, one CVE per language) \| 273 \| 0.942 \| 0.994 \| 0.967 \|`
			`\| 2026-04-22 \| Cross-file points-to summaries, SCC joint fixed-point, backwards taint \| 273 \| 0.940 \| 0.994 \| 0.966 \|`
			`\| 2026-04-22 \| Cross-file context-sensitive inline taint (k=1) \| 270 \| 0.940 \| 0.994 \| 0.966 \|`
			`\| 2026-04-20 \| Rust weak-spot fixes across FILE_IO, SSRF, SQL, DESERIALIZE sink families \| 262 \| 0.906 \| 0.994 \| 0.948 \|`
			`\| 2026-04-20 \| TypeScript weak-spot fixes, Fastify framework detection, TSX/JSX grammar \| 262 \| 0.899 \| 0.981 \| 0.938 \|`
			`\| 2026-04-20 \| Rust corpus expansion: honest FNs in classes lacking Rust rules \| 262 \| 0.891 \| 0.961 \| 0.925 \|`
			`\| 2026-04-20 \| TypeScript corpus 0 to 32 cases across 12 vuln classes \| 246 \| 0.904 \| 0.986 \| 0.944 \|`
			`\| 2026-03-24 \| Benchmark expansion: C, C++, Rust as first-class; +73 cases \| 214 \| 0.827 \| 0.950 \| 0.885 \|`
			`\| 2026-03-22 \| Cross-file SSA validation, multi-file directory cases \| 141 \| 0.840 \| 0.975 \| 0.903 \|`
			`\| 2026-03-22 \| Ruby corpus 1 to 21 cases across 8 vuln classes \| 123 \| 0.821 \| 0.986 \| 0.896 \|`
			`\| 2026-03-22 \| SSA lowering hardening (PHP closures, Python try/except, exception edges) \| 103 \| 0.841 \| 0.983 \| 0.906 \|`
			`\| 2026-03-21 \| SSRF semantic completion (axios, got, undici, httpx, Net::HTTP, HTTParty) \| 103 \| 0.671 \| 0.966 \| 0.792 \|`
			`\| 2026-03-21 \| Constant-arg suppression at AST and CFG level \| 95 \| 0.654 \| 0.964 \| 0.779 \|`
			\| 2026-03-21 \| Bare `exec`/`execSync` as JS CMDI sinks; Python `Template` as XSS sink \| 95 \| 0.624 \| 0.964 \| 0.757 \|
			`\| 2026-03-21 \| First baseline after symbolic-strings work \| 95 \| 0.620 \| 0.891 \| 0.731 \|`

			`## Known limitations`

			`These show up across multiple corpora and aren't fully fixed yet.`

			- Variable-receiver method calls (`client.send(...)` vs `HttpClient.send(...)`) miss without an inferred receiver type. Type-aware callee resolution closes most cases; some residuals remain.
			- Arbitrary import aliases (`from flask import request as r`) aren't traced. Only explicitly listed aliases resolve.
			`- URL-parsing isn't credited as SSRF sanitization. Allowlist checks in conditions are recognised; call-site sanitizers aren't.`
			`- Rust unguarded-sink still fires for shell-escape sinks when a source is in scope but not flowing to the sink arg. Intentional for high-risk classes.`
			- Rust negative-validation patterns (`contains` dominators, match-arm guards) aren't recognised yet.
			`- DNS rebinding and async-callback flows are out of scope for static analysis without runtime context.`