apunkt/nyx

mirror of https://github.com/elicpeter/nyx.git synced 2026-07-24 21:41:02 +02:00

Eli Peter 82f18184b1 Prerelease cleanup (#46 ) * feat: Add const_bound_vars tracking to prevent false positives in ownership checks * feat: Introduce field interner and typed bounded vars for enhanced type tracking * feat: Add typed_call_receivers and typed_bounded_dto_fields for enhanced type tracking * feat: Centralize method name extraction with bare_method_name helper * feat: Implement Phase-6 hierarchy fan-out for runtime virtual dispatch * feat: Enhance C++ taint tracking with additional container operations and inline method resolution * feat: Introduce field-sensitive points-to analysis for enhanced resource tracking * feat: Implement Pointer-Phase 6 subscript handling for enhanced container analysis * test: Add comprehensive tests for JavaScript control flow constructs and lattice operations * docs: Update advanced analysis documentation with field-sensitive points-to and hierarchy fan-out details * test: Add comprehensive tests for lattice algebra laws and SSA edge cases * feat: Add destructured session user handling and safe user ID access patterns * feat: Implement row-population reverse-walk for enhanced authorization checks * feat: Enhance authorization checks with local alias chain for self-actor types * feat: Introduce ActiveRecord query safety checks and enhance snippet extraction * feat: Implement chained method call inner-gate rebinding for SSRF prevention * feat: Add observability and error modules, enhance debug functionality, and implement theme context * feat: Remove Auth Analysis page and update navigation to redirect to Explorer * feat: Optimize SSA lowering by sharing results between taint engine and artifact extractor * feat: Optimize SSA lowering by sharing results between taint engine and artifact extractor * feat: Reset path-safe-suppressed spans before lowering to maintain analysis integrity * fix(ssa): ungate debug_assert_bfs_ordering for release-tests build The helper at src/ssa/lower.rs was gated `#[cfg(debug_assertions)]` while the unit test at the bottom of the file was gated only `#[cfg(test)]`. Since `cfg(test)` is set in release builds with `--tests` but `cfg(debug_assertions)` is not, `cargo build --release --tests` failed with E0425. Removing the gate fixes the build; the body is `debug_assert!` only, so the helper is free in release. Also drop the gate at the call site to avoid a `dead_code` warning when the lib is built without `--tests`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(closure-capture): flip JS/TS fixtures to required-finding The JS and TS closure-capture fixtures pinned the old broken behaviour via `forbidden_findings: [{ "id_prefix": "taint-" }]`. The engine now correctly traces taint through the closure boundary (env source captured by an arrow function, sunk via `child_process.exec` inside the body), so the formerly-forbidden finding is a true positive. Match the Python sibling's shape — `required_findings` with `id_prefix` + `min_count` plus a small `noise_budget` — and rewrite the companion READMEs and the phase8_fragility_tests doc-comments from "known gap" to "regression guard". Verified: - cargo test --release --test phase8_fragility_tests → 8/8 pass - cargo test --release --lib bfs_assertion → pass - corpus benchmark F1 = 0.9976 (TP=205, FP=1, FN=0) — unchanged Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: Add OWASP mapping and baseline mutation hooks for enhanced security analysis * feat: Introduce health module and enhance health score computation with calibration tests * feat: Add expectations configuration and cleanup .gitignore for log files * feat: Implement theme selection and enhance settings panel for triage sync * feat: Suppress false positives for strcpy calls with literal sources in AST * feat: Update analyse_function_ssa to return body CFG for accurate analysis * feat: Add bug report and feature request templates for improved issue tracking * feat: removed dev scripts * feat: update README.md for clarity and consistency in fixture descriptions * feat: removed dev docs * feat: clean up error handling and UI elements for improved user experience * feat: adjust button sizes in HeaderBar for better UI consistency * feat: enhance taint analysis with additional context for sanitizer and taint findings * cargo fmt * prettier * refactor: simplify conditional checks and improve code readability in AST and screenshot capture scripts * feat: add script to frame PNG screenshots with brand gradient * feat: add fuzzing support with new targets and CI workflows * refactor: streamline match expressions and improve formatting in CLI and output handling * feat: enhance configuration display with detailed output options * feat: stage demo configuration for improved CLI screenshot output * feat: expose merge_configs function for user-configurable settings * refactor: simplify code structure and improve readability in config handling * refactor: improve descriptions for vulnerability patterns in various languages * feat: update MIT License section with additional usage details and copyright information * feat: update screenshots * refactor: update build process and paths for frontend assets * feat: add cross-file taint fuzzing target and supporting dictionary * refactor: clean up formatting and comments in fuzz configuration and example files * refactor: remove outdated comments and clean up CI configuration files * chore: update changelog dates and improve formatting in documentation * refactor: update Cargo.toml and CI configuration for improved packaging and build process * refactor: enhance quote-stripping logic to prevent panics and add regression tests --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-29 00:58:38 -04:00
..
corpus	Prerelease cleanup (#46 )	2026-04-29 00:58:38 -04:00
cve_corpus	Prerelease cleanup (#46 )	2026-04-29 00:58:38 -04:00
results	Prerelease cleanup (#46 )	2026-04-29 00:58:38 -04:00
ground_truth.json	Prerelease cleanup (#46 )	2026-04-29 00:58:38 -04:00
README.md	Prerelease cleanup (#46 )	2026-04-29 00:58:38 -04:00
RESULTS.md	Prerelease cleanup (#46 )	2026-04-29 00:58:38 -04:00

README.md

Nyx Benchmark Evaluation Framework

Corpus philosophy

The benchmark corpus is a curated set of ~430 minimal synthetic files (8-20 lines each) across 10 languages: JavaScript, TypeScript, Python, Java, Go, PHP, Ruby, Rust, C, and C++. Each file contains exactly one vulnerability (positive case) or demonstrates a specific safe pattern (negative case). The corpus additionally carries a small set of real-CVE replay cases (see cve_corpus/ and the "Real CVE coverage" section in RESULTS.md).

Design principles:

One vuln per file: isolates the detection signal from noise.
Analogue cases allowed: when a language lacks a specific sink (e.g., JS has no SQL_QUERY sink), we use an equivalent sink (e.g., eval()) to test the same dataflow concept. These are tagged equivalence_tier: "analogue".
Semantic truth: is_vulnerable reflects whether the code is vulnerable, independent of whether the current scanner detects it. This means some FNs are expected and acceptable.
Not full CWE coverage: the corpus tests the vulnerability classes Nyx targets, not every possible CWE.

Scoring modes

Mode 1: File-Level Presence (coarsest)

Does the scanner produce any security finding for this file?

TP: vulnerable file with at least one security finding
FP: safe file with any security finding
FN: vulnerable file with no security findings
TN: safe file with no security findings

Mode 2: Vuln-Class Scoring

Groups cases by vuln_class and computes precision/recall/F1 per class. Shows which vulnerability categories are strong or weak.

Mode 3: Rule-Level Scoring

Checks whether the correct rule fired:

TP: a finding matches expected_rule_ids or allowed_alternative_rule_ids
FP: safe file with any security finding, OR forbidden_rule_ids matched on a vulnerable file
FN: vulnerable file where no expected/alternative rule matched

Rule matching: exact match first, then substring fallback.

Mode 4: Location-Aware Scoring

When expected_sink_lines is present, checks that a matching finding falls within ±2 lines of the expected sink location. Falls back to Mode 3 when no line info is specified.

What metrics mean and don't mean

Precision measures false positive rate: how often a flagged file truly has a vulnerability.
Recall measures detection rate: how many real vulnerabilities the scanner catches.
F1 is the harmonic mean, balancing precision and recall.

Caveats:

Scores on synthetic micro-benchmarks don't predict real-world performance.
equivalence_tier: "analogue" cases may inflate or deflate metrics depending on whether the proxy sink behaves like the real one.
equivalence_tier: "language_specific" cases have no cross-language equivalent and are scored independently.
Some FNs are expected (e.g., interprocedural safe flows the scanner doesn't yet track).

How to run

# Full benchmark (every case in ground_truth.json)
cargo test benchmark_evaluation -- --ignored --nocapture

# Filter by language (python, typescript, javascript, java, go, php, ruby, rust, c, cpp)
NYX_BENCH_LANG=typescript cargo test benchmark_evaluation -- --ignored --nocapture

# Filter by vulnerability class
NYX_BENCH_CLASS=sqli cargo test benchmark_evaluation -- --ignored --nocapture

# Single case
NYX_BENCH_CASE=js-sqli-001 cargo test benchmark_evaluation -- --ignored --nocapture

# Only positive (vulnerable) cases
NYX_BENCH_POSITIVE_ONLY=1 cargo test benchmark_evaluation -- --ignored --nocapture

# Only negative (safe) cases
NYX_BENCH_NEGATIVE_ONLY=1 cargo test benchmark_evaluation -- --ignored --nocapture

# Filter by tag
NYX_BENCH_TAG=express cargo test benchmark_evaluation -- --ignored --nocapture

How to add a new case

Create a corpus file in corpus/{language}/{vuln_class}/filename.ext (8-20 lines, one vulnerability or safe pattern).
Add a case entry to ground_truth.json with all required fields.
Run the benchmark: cargo test benchmark_evaluation -- --ignored --nocapture
Verify the outcome matches your expectation.

How to fix a case

If a case outcome is unexpected:

Investigate the root cause: is the scanner wrong, or is the ground truth wrong?
If the scanner is wrong, fix the scanner (not the ground truth).
If the ground truth is wrong (e.g., wrong expected_rule_ids), update it with justification.
Never auto-normalize ground truth to match scanner output.

How to regenerate results

Run the benchmark. results/latest.json is overwritten each time:

cargo test benchmark_evaluation -- --ignored --nocapture

Regression gate (CI)

The accuracy regression floor is enforced in CI by the benchmark-gate job in .github/workflows/ci.yml. Each pull request runs:

cargo test --release --all-features --test benchmark_test -- --ignored --nocapture benchmark_evaluation

and fails if the corpus rule-level metrics fall below the thresholds encoded at the bottom of tests/benchmark_test.rs:

Metric	Floor	Current baseline (~432 cases)
Precision	≥ 0.861	0.991
Recall	≥ 0.944	0.995
F1	≥ 0.901	0.993

The floors sit roughly 8 pp below the current baseline. A single-case flip is about 0.2 pp on this corpus, so the headroom absorbs honest FP/TN trades while still tripping on a real regression in a whole vulnerability class. The results/latest.json artifact is uploaded from the CI job for comparison across runs. The Rust job cache is warm, so the gate typically adds only a few seconds on top of the build.

Updating the thresholds is a deliberate change. Raise them when you land a measurable, durable improvement; never relax them to paper over a regression.