mirror of
https://github.com/elicpeter/nyx.git
synced 2026-06-09 19:45:13 +02:00
* feat: Add const_bound_vars tracking to prevent false positives in ownership checks
* feat: Introduce field interner and typed bounded vars for enhanced type tracking
* feat: Add typed_call_receivers and typed_bounded_dto_fields for enhanced type tracking
* feat: Centralize method name extraction with bare_method_name helper
* feat: Implement Phase-6 hierarchy fan-out for runtime virtual dispatch
* feat: Enhance C++ taint tracking with additional container operations and inline method resolution
* feat: Introduce field-sensitive points-to analysis for enhanced resource tracking
* feat: Implement Pointer-Phase 6 subscript handling for enhanced container analysis
* test: Add comprehensive tests for JavaScript control flow constructs and lattice operations
* docs: Update advanced analysis documentation with field-sensitive points-to and hierarchy fan-out details
* test: Add comprehensive tests for lattice algebra laws and SSA edge cases
* feat: Add destructured session user handling and safe user ID access patterns
* feat: Implement row-population reverse-walk for enhanced authorization checks
* feat: Enhance authorization checks with local alias chain for self-actor types
* feat: Introduce ActiveRecord query safety checks and enhance snippet extraction
* feat: Implement chained method call inner-gate rebinding for SSRF prevention
* feat: Add observability and error modules, enhance debug functionality, and implement theme context
* feat: Remove Auth Analysis page and update navigation to redirect to Explorer
* feat: Optimize SSA lowering by sharing results between taint engine and artifact extractor
* feat: Optimize SSA lowering by sharing results between taint engine and artifact extractor
* feat: Reset path-safe-suppressed spans before lowering to maintain analysis integrity
* fix(ssa): ungate debug_assert_bfs_ordering for release-tests build
The helper at src/ssa/lower.rs was gated `#[cfg(debug_assertions)]` while
the unit test at the bottom of the file was gated only `#[cfg(test)]`.
Since `cfg(test)` is set in release builds with `--tests` but
`cfg(debug_assertions)` is not, `cargo build --release --tests` failed
with E0425. Removing the gate fixes the build; the body is `debug_assert!`
only, so the helper is free in release. Also drop the gate at the call
site to avoid a `dead_code` warning when the lib is built without
`--tests`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(closure-capture): flip JS/TS fixtures to required-finding
The JS and TS closure-capture fixtures pinned the old broken behaviour
via `forbidden_findings: [{ "id_prefix": "taint-" }]`. The engine now
correctly traces taint through the closure boundary (env source captured
by an arrow function, sunk via `child_process.exec` inside the body), so
the formerly-forbidden finding is a true positive.
Match the Python sibling's shape — `required_findings` with
`id_prefix` + `min_count` plus a small `noise_budget` — and rewrite the
companion READMEs and the phase8_fragility_tests doc-comments from
"known gap" to "regression guard".
Verified:
- cargo test --release --test phase8_fragility_tests → 8/8 pass
- cargo test --release --lib bfs_assertion → pass
- corpus benchmark F1 = 0.9976 (TP=205, FP=1, FN=0) — unchanged
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: Add OWASP mapping and baseline mutation hooks for enhanced security analysis
* feat: Introduce health module and enhance health score computation with calibration tests
* feat: Add expectations configuration and cleanup .gitignore for log files
* feat: Implement theme selection and enhance settings panel for triage sync
* feat: Suppress false positives for strcpy calls with literal sources in AST
* feat: Update analyse_function_ssa to return body CFG for accurate analysis
* feat: Add bug report and feature request templates for improved issue tracking
* feat: removed dev scripts
* feat: update README.md for clarity and consistency in fixture descriptions
* feat: removed dev docs
* feat: clean up error handling and UI elements for improved user experience
* feat: adjust button sizes in HeaderBar for better UI consistency
* feat: enhance taint analysis with additional context for sanitizer and taint findings
* cargo fmt
* prettier
* refactor: simplify conditional checks and improve code readability in AST and screenshot capture scripts
* feat: add script to frame PNG screenshots with brand gradient
* feat: add fuzzing support with new targets and CI workflows
* refactor: streamline match expressions and improve formatting in CLI and output handling
* feat: enhance configuration display with detailed output options
* feat: stage demo configuration for improved CLI screenshot output
* feat: expose merge_configs function for user-configurable settings
* refactor: simplify code structure and improve readability in config handling
* refactor: improve descriptions for vulnerability patterns in various languages
* feat: update MIT License section with additional usage details and copyright information
* feat: update screenshots
* refactor: update build process and paths for frontend assets
* feat: add cross-file taint fuzzing target and supporting dictionary
* refactor: clean up formatting and comments in fuzz configuration and example files
* refactor: remove outdated comments and clean up CI configuration files
* chore: update changelog dates and improve formatting in documentation
* refactor: update Cargo.toml and CI configuration for improved packaging and build process
* refactor: enhance quote-stripping logic to prevent panics and add regression tests
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
122 lines
5.6 KiB
Markdown
122 lines
5.6 KiB
Markdown
# Nyx Benchmark Evaluation Framework
|
|
|
|
## Corpus philosophy
|
|
|
|
The benchmark corpus is a curated set of ~430 minimal synthetic files (8-20 lines each) across 10 languages: JavaScript, TypeScript, Python, Java, Go, PHP, Ruby, Rust, C, and C++. Each file contains exactly one vulnerability (positive case) or demonstrates a specific safe pattern (negative case). The corpus additionally carries a small set of real-CVE replay cases (see `cve_corpus/` and the "Real CVE coverage" section in `RESULTS.md`).
|
|
|
|
Design principles:
|
|
- **One vuln per file**: isolates the detection signal from noise.
|
|
- **Analogue cases allowed**: when a language lacks a specific sink (e.g., JS has no SQL_QUERY sink), we use an equivalent sink (e.g., `eval()`) to test the same dataflow concept. These are tagged `equivalence_tier: "analogue"`.
|
|
- **Semantic truth**: `is_vulnerable` reflects whether the code *is* vulnerable, independent of whether the current scanner detects it. This means some FNs are expected and acceptable.
|
|
- **Not full CWE coverage**: the corpus tests the vulnerability classes Nyx targets, not every possible CWE.
|
|
|
|
## Scoring modes
|
|
|
|
### Mode 1: File-Level Presence (coarsest)
|
|
Does the scanner produce *any* security finding for this file?
|
|
- TP: vulnerable file with at least one security finding
|
|
- FP: safe file with any security finding
|
|
- FN: vulnerable file with no security findings
|
|
- TN: safe file with no security findings
|
|
|
|
### Mode 2: Vuln-Class Scoring
|
|
Groups cases by `vuln_class` and computes precision/recall/F1 per class. Shows which vulnerability categories are strong or weak.
|
|
|
|
### Mode 3: Rule-Level Scoring
|
|
Checks whether the *correct* rule fired:
|
|
- TP: a finding matches `expected_rule_ids` or `allowed_alternative_rule_ids`
|
|
- FP: safe file with any security finding, OR `forbidden_rule_ids` matched on a vulnerable file
|
|
- FN: vulnerable file where no expected/alternative rule matched
|
|
|
|
Rule matching: exact match first, then substring fallback.
|
|
|
|
### Mode 4: Location-Aware Scoring
|
|
When `expected_sink_lines` is present, checks that a matching finding falls within ±2 lines of the expected sink location. Falls back to Mode 3 when no line info is specified.
|
|
|
|
## What metrics mean and don't mean
|
|
|
|
- **Precision** measures false positive rate: how often a flagged file truly has a vulnerability.
|
|
- **Recall** measures detection rate: how many real vulnerabilities the scanner catches.
|
|
- **F1** is the harmonic mean, balancing precision and recall.
|
|
|
|
Caveats:
|
|
- Scores on synthetic micro-benchmarks don't predict real-world performance.
|
|
- `equivalence_tier: "analogue"` cases may inflate or deflate metrics depending on whether the proxy sink behaves like the real one.
|
|
- `equivalence_tier: "language_specific"` cases have no cross-language equivalent and are scored independently.
|
|
- Some FNs are *expected* (e.g., interprocedural safe flows the scanner doesn't yet track).
|
|
|
|
## How to run
|
|
|
|
```bash
|
|
# Full benchmark (every case in ground_truth.json)
|
|
cargo test benchmark_evaluation -- --ignored --nocapture
|
|
|
|
# Filter by language (python, typescript, javascript, java, go, php, ruby, rust, c, cpp)
|
|
NYX_BENCH_LANG=typescript cargo test benchmark_evaluation -- --ignored --nocapture
|
|
|
|
# Filter by vulnerability class
|
|
NYX_BENCH_CLASS=sqli cargo test benchmark_evaluation -- --ignored --nocapture
|
|
|
|
# Single case
|
|
NYX_BENCH_CASE=js-sqli-001 cargo test benchmark_evaluation -- --ignored --nocapture
|
|
|
|
# Only positive (vulnerable) cases
|
|
NYX_BENCH_POSITIVE_ONLY=1 cargo test benchmark_evaluation -- --ignored --nocapture
|
|
|
|
# Only negative (safe) cases
|
|
NYX_BENCH_NEGATIVE_ONLY=1 cargo test benchmark_evaluation -- --ignored --nocapture
|
|
|
|
# Filter by tag
|
|
NYX_BENCH_TAG=express cargo test benchmark_evaluation -- --ignored --nocapture
|
|
```
|
|
|
|
## How to add a new case
|
|
|
|
1. Create a corpus file in `corpus/{language}/{vuln_class}/filename.ext` (8-20 lines, one vulnerability or safe pattern).
|
|
2. Add a case entry to `ground_truth.json` with all required fields.
|
|
3. Run the benchmark: `cargo test benchmark_evaluation -- --ignored --nocapture`
|
|
4. Verify the outcome matches your expectation.
|
|
|
|
## How to fix a case
|
|
|
|
If a case outcome is unexpected:
|
|
1. Investigate the root cause: is the scanner wrong, or is the ground truth wrong?
|
|
2. If the scanner is wrong, fix the scanner (not the ground truth).
|
|
3. If the ground truth is wrong (e.g., wrong expected_rule_ids), update it with justification.
|
|
4. Never auto-normalize ground truth to match scanner output.
|
|
|
|
## How to regenerate results
|
|
|
|
Run the benchmark. `results/latest.json` is overwritten each time:
|
|
|
|
```bash
|
|
cargo test benchmark_evaluation -- --ignored --nocapture
|
|
```
|
|
|
|
## Regression gate (CI)
|
|
|
|
The accuracy regression floor is enforced in CI by the `benchmark-gate` job in
|
|
`.github/workflows/ci.yml`. Each pull request runs:
|
|
|
|
```bash
|
|
cargo test --release --all-features --test benchmark_test -- --ignored --nocapture benchmark_evaluation
|
|
```
|
|
|
|
and fails if the corpus rule-level metrics fall below the thresholds encoded
|
|
at the bottom of `tests/benchmark_test.rs`:
|
|
|
|
| Metric | Floor | Current baseline (~432 cases) |
|
|
|---|---|---|
|
|
| Precision | ≥ 0.861 | 0.991 |
|
|
| Recall | ≥ 0.944 | 0.995 |
|
|
| F1 | ≥ 0.901 | 0.993 |
|
|
|
|
The floors sit roughly 8 pp below the current baseline. A single-case flip
|
|
is about 0.2 pp on this corpus, so the headroom absorbs honest FP/TN
|
|
trades while still tripping on a real regression in a whole vulnerability
|
|
class. The `results/latest.json` artifact is uploaded from the CI job for
|
|
comparison across runs. The Rust job cache is warm, so the gate typically
|
|
adds only a few seconds on top of the build.
|
|
|
|
Updating the thresholds is a deliberate change. Raise them when you land a
|
|
measurable, durable improvement; never relax them to paper over a regression.
|