nyx/tests/benchmark
Eli Peter 82f18184b1
Prerelease cleanup (#46)
* feat: Add const_bound_vars tracking to prevent false positives in ownership checks

* feat: Introduce field interner and typed bounded vars for enhanced type tracking

* feat: Add typed_call_receivers and typed_bounded_dto_fields for enhanced type tracking

* feat: Centralize method name extraction with bare_method_name helper

* feat: Implement Phase-6 hierarchy fan-out for runtime virtual dispatch

* feat: Enhance C++ taint tracking with additional container operations and inline method resolution

* feat: Introduce field-sensitive points-to analysis for enhanced resource tracking

* feat: Implement Pointer-Phase 6 subscript handling for enhanced container analysis

* test: Add comprehensive tests for JavaScript control flow constructs and lattice operations

* docs: Update advanced analysis documentation with field-sensitive points-to and hierarchy fan-out details

* test: Add comprehensive tests for lattice algebra laws and SSA edge cases

* feat: Add destructured session user handling and safe user ID access patterns

* feat: Implement row-population reverse-walk for enhanced authorization checks

* feat: Enhance authorization checks with local alias chain for self-actor types

* feat: Introduce ActiveRecord query safety checks and enhance snippet extraction

* feat: Implement chained method call inner-gate rebinding for SSRF prevention

* feat: Add observability and error modules, enhance debug functionality, and implement theme context

* feat: Remove Auth Analysis page and update navigation to redirect to Explorer

* feat: Optimize SSA lowering by sharing results between taint engine and artifact extractor

* feat: Optimize SSA lowering by sharing results between taint engine and artifact extractor

* feat: Reset path-safe-suppressed spans before lowering to maintain analysis integrity

* fix(ssa): ungate debug_assert_bfs_ordering for release-tests build

The helper at src/ssa/lower.rs was gated `#[cfg(debug_assertions)]` while
the unit test at the bottom of the file was gated only `#[cfg(test)]`.
Since `cfg(test)` is set in release builds with `--tests` but
`cfg(debug_assertions)` is not, `cargo build --release --tests` failed
with E0425. Removing the gate fixes the build; the body is `debug_assert!`
only, so the helper is free in release. Also drop the gate at the call
site to avoid a `dead_code` warning when the lib is built without
`--tests`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(closure-capture): flip JS/TS fixtures to required-finding

The JS and TS closure-capture fixtures pinned the old broken behaviour
via `forbidden_findings: [{ "id_prefix": "taint-" }]`. The engine now
correctly traces taint through the closure boundary (env source captured
by an arrow function, sunk via `child_process.exec` inside the body), so
the formerly-forbidden finding is a true positive.

Match the Python sibling's shape — `required_findings` with
`id_prefix` + `min_count` plus a small `noise_budget` — and rewrite the
companion READMEs and the phase8_fragility_tests doc-comments from
"known gap" to "regression guard".

Verified:
- cargo test --release --test phase8_fragility_tests → 8/8 pass
- cargo test --release --lib bfs_assertion → pass
- corpus benchmark F1 = 0.9976 (TP=205, FP=1, FN=0) — unchanged

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: Add OWASP mapping and baseline mutation hooks for enhanced security analysis

* feat: Introduce health module and enhance health score computation with calibration tests

* feat: Add expectations configuration and cleanup .gitignore for log files

* feat: Implement theme selection and enhance settings panel for triage sync

* feat: Suppress false positives for strcpy calls with literal sources in AST

* feat: Update analyse_function_ssa to return body CFG for accurate analysis

* feat: Add bug report and feature request templates for improved issue tracking

* feat: removed dev scripts

* feat: update README.md for clarity and consistency in fixture descriptions

* feat: removed dev docs

* feat: clean up error handling and UI elements for improved user experience

* feat: adjust button sizes in HeaderBar for better UI consistency

* feat: enhance taint analysis with additional context for sanitizer and taint findings

* cargo fmt

* prettier

* refactor: simplify conditional checks and improve code readability in AST and screenshot capture scripts

* feat: add script to frame PNG screenshots with brand gradient

* feat: add fuzzing support with new targets and CI workflows

* refactor: streamline match expressions and improve formatting in CLI and output handling

* feat: enhance configuration display with detailed output options

* feat: stage demo configuration for improved CLI screenshot output

* feat: expose merge_configs function for user-configurable settings

* refactor: simplify code structure and improve readability in config handling

* refactor: improve descriptions for vulnerability patterns in various languages

* feat: update MIT License section with additional usage details and copyright information

* feat: update screenshots

* refactor: update build process and paths for frontend assets

* feat: add cross-file taint fuzzing target and supporting dictionary

* refactor: clean up formatting and comments in fuzz configuration and example files

* refactor: remove outdated comments and clean up CI configuration files

* chore: update changelog dates and improve formatting in documentation

* refactor: update Cargo.toml and CI configuration for improved packaging and build process

* refactor: enhance quote-stripping logic to prevent panics and add regression tests

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:58:38 -04:00
..
corpus Prerelease cleanup (#46) 2026-04-29 00:58:38 -04:00
cve_corpus Prerelease cleanup (#46) 2026-04-29 00:58:38 -04:00
results Prerelease cleanup (#46) 2026-04-29 00:58:38 -04:00
ground_truth.json Prerelease cleanup (#46) 2026-04-29 00:58:38 -04:00
README.md Prerelease cleanup (#46) 2026-04-29 00:58:38 -04:00
RESULTS.md Prerelease cleanup (#46) 2026-04-29 00:58:38 -04:00

Nyx Benchmark Evaluation Framework

Corpus philosophy

The benchmark corpus is a curated set of ~430 minimal synthetic files (8-20 lines each) across 10 languages: JavaScript, TypeScript, Python, Java, Go, PHP, Ruby, Rust, C, and C++. Each file contains exactly one vulnerability (positive case) or demonstrates a specific safe pattern (negative case). The corpus additionally carries a small set of real-CVE replay cases (see cve_corpus/ and the "Real CVE coverage" section in RESULTS.md).

Design principles:

  • One vuln per file: isolates the detection signal from noise.
  • Analogue cases allowed: when a language lacks a specific sink (e.g., JS has no SQL_QUERY sink), we use an equivalent sink (e.g., eval()) to test the same dataflow concept. These are tagged equivalence_tier: "analogue".
  • Semantic truth: is_vulnerable reflects whether the code is vulnerable, independent of whether the current scanner detects it. This means some FNs are expected and acceptable.
  • Not full CWE coverage: the corpus tests the vulnerability classes Nyx targets, not every possible CWE.

Scoring modes

Mode 1: File-Level Presence (coarsest)

Does the scanner produce any security finding for this file?

  • TP: vulnerable file with at least one security finding
  • FP: safe file with any security finding
  • FN: vulnerable file with no security findings
  • TN: safe file with no security findings

Mode 2: Vuln-Class Scoring

Groups cases by vuln_class and computes precision/recall/F1 per class. Shows which vulnerability categories are strong or weak.

Mode 3: Rule-Level Scoring

Checks whether the correct rule fired:

  • TP: a finding matches expected_rule_ids or allowed_alternative_rule_ids
  • FP: safe file with any security finding, OR forbidden_rule_ids matched on a vulnerable file
  • FN: vulnerable file where no expected/alternative rule matched

Rule matching: exact match first, then substring fallback.

Mode 4: Location-Aware Scoring

When expected_sink_lines is present, checks that a matching finding falls within ±2 lines of the expected sink location. Falls back to Mode 3 when no line info is specified.

What metrics mean and don't mean

  • Precision measures false positive rate: how often a flagged file truly has a vulnerability.
  • Recall measures detection rate: how many real vulnerabilities the scanner catches.
  • F1 is the harmonic mean, balancing precision and recall.

Caveats:

  • Scores on synthetic micro-benchmarks don't predict real-world performance.
  • equivalence_tier: "analogue" cases may inflate or deflate metrics depending on whether the proxy sink behaves like the real one.
  • equivalence_tier: "language_specific" cases have no cross-language equivalent and are scored independently.
  • Some FNs are expected (e.g., interprocedural safe flows the scanner doesn't yet track).

How to run

# Full benchmark (every case in ground_truth.json)
cargo test benchmark_evaluation -- --ignored --nocapture

# Filter by language (python, typescript, javascript, java, go, php, ruby, rust, c, cpp)
NYX_BENCH_LANG=typescript cargo test benchmark_evaluation -- --ignored --nocapture

# Filter by vulnerability class
NYX_BENCH_CLASS=sqli cargo test benchmark_evaluation -- --ignored --nocapture

# Single case
NYX_BENCH_CASE=js-sqli-001 cargo test benchmark_evaluation -- --ignored --nocapture

# Only positive (vulnerable) cases
NYX_BENCH_POSITIVE_ONLY=1 cargo test benchmark_evaluation -- --ignored --nocapture

# Only negative (safe) cases
NYX_BENCH_NEGATIVE_ONLY=1 cargo test benchmark_evaluation -- --ignored --nocapture

# Filter by tag
NYX_BENCH_TAG=express cargo test benchmark_evaluation -- --ignored --nocapture

How to add a new case

  1. Create a corpus file in corpus/{language}/{vuln_class}/filename.ext (8-20 lines, one vulnerability or safe pattern).
  2. Add a case entry to ground_truth.json with all required fields.
  3. Run the benchmark: cargo test benchmark_evaluation -- --ignored --nocapture
  4. Verify the outcome matches your expectation.

How to fix a case

If a case outcome is unexpected:

  1. Investigate the root cause: is the scanner wrong, or is the ground truth wrong?
  2. If the scanner is wrong, fix the scanner (not the ground truth).
  3. If the ground truth is wrong (e.g., wrong expected_rule_ids), update it with justification.
  4. Never auto-normalize ground truth to match scanner output.

How to regenerate results

Run the benchmark. results/latest.json is overwritten each time:

cargo test benchmark_evaluation -- --ignored --nocapture

Regression gate (CI)

The accuracy regression floor is enforced in CI by the benchmark-gate job in .github/workflows/ci.yml. Each pull request runs:

cargo test --release --all-features --test benchmark_test -- --ignored --nocapture benchmark_evaluation

and fails if the corpus rule-level metrics fall below the thresholds encoded at the bottom of tests/benchmark_test.rs:

Metric Floor Current baseline (~432 cases)
Precision ≥ 0.861 0.991
Recall ≥ 0.944 0.995
F1 ≥ 0.901 0.993

The floors sit roughly 8 pp below the current baseline. A single-case flip is about 0.2 pp on this corpus, so the headroom absorbs honest FP/TN trades while still tripping on a real regression in a whole vulnerability class. The results/latest.json artifact is uploaded from the CI job for comparison across runs. The Rust job cache is warm, so the gate typically adds only a few seconds on top of the build.

Updating the thresholds is a deliberate change. Raise them when you land a measurable, durable improvement; never relax them to paper over a regression.