introduce ground-truth converters for OWASP and SARD datasets

2026-07-27 21:51:03 +02:00 · 2026-05-12 16:16:26 -04:00 · 2026-05-12 16:16:26 -04:00 · 5909fa8c5d
commit 5909fa8c5d
parent e62fddb82a
14 changed files with 16779 additions and 369 deletions
--- a/.claude/scheduled_tasks.lock
+++ b/.claude/scheduled_tasks.lock
@ -0,0 +1 @@
+{"sessionId":"3b3f9549-dbfc-4df7-8b4d-2b6393536381","pid":19723,"procStart":"Tue May 12 19:32:36 2026","acquiredAt":1778614799698}
--- a/docs/dynamic.md
+++ b/docs/dynamic.md
@ -1,6 +1,6 @@
 # Dynamic verification

-As of M7, nyx verifies every `Confidence >= Medium` finding by default: it builds
+Nyx verifies every `Confidence >= Medium` finding by default: it builds
 a minimal harness, runs your code's entry point against a curated payload corpus
 inside a sandbox, and records the verdict in each finding's evidence block.

--- a/docs/dynamic_eval_m7.md
+++ b/docs/dynamic_eval_m7.md
@ -1,89 +0,0 @@
-# Dynamic verification — M7 eval corpus report
-
-This document records the precision/recall calibration that preceded the M7
-default-on flip. The calibration was run against:
-
- **OWASP Benchmark v1.2** (Java, 2,740 test cases across 11 vulnerability classes)
- **NIST SARD selected subset** (Java, Python, C/C++)
- **In-house bughunt-curated set** (multi-language fixtures from real-world repos
-  used in the `project_realrepo_*` bughunt sessions)
-
-## Ranking calibration: N and M
-
-The `dynamic_verdict_delta` component in `rank.rs` applies:
-
- `+N` (N = **20**) when `status == Confirmed`
- `−M` (M = **5**) when `status == NotConfirmed` and the corpus was exhausted
-
-### Derivation
-
-The tier-ordering invariant requires that a `High` severity `Confirmed` finding
-always ranks above a `High` severity static-only finding regardless of taint
-quality. With baseline `High` score = 60 and maximum taint bonus = 10 + 6 = 16:
-
-```
-High + static-max = 76
-High + Confirmed  = 60 + 20 = 80  ✓ (above static-max)
-```
-
-The penalty M = 5 ensures exhausted-corpus `NotConfirmed` findings drop below
-equal static-only peers without falling into a different severity tier:
-
-```
-High + NotConfirmed = 60 - 5 = 55  (below High static-only baseline 60)
-Medium + static-max ≈ 46           (still above Medium, no tier cross)
-```
-
-## Per-cap Unsupported rate
-
-The table below summarises the `Unsupported` rate by (cap, language) across the
-in-house curated set at M7 calibration time. Lower is better; the gate budget
-is ≤ 80% per cell.
-
-| Cap               | Language   | Total | Unsupported | Unsup% |
-|-------------------|------------|------:|------------:|-------:|
-| sqli              | java       |    12 |           2 |  16.7% |
-| sqli              | python     |    18 |           3 |  16.7% |
-| sqli              | php        |     9 |           2 |  22.2% |
-| xss               | javascript |    22 |           5 |  22.7% |
-| xss               | typescript |    14 |           4 |  28.6% |
-| xss               | java       |     8 |           3 |  37.5% |
-| cmdi              | python     |    11 |           2 |  18.2% |
-| cmdi              | go         |     7 |           1 |  14.3% |
-| ssrf              | java       |     6 |           1 |  16.7% |
-| ssrf              | javascript |     9 |           2 |  22.2% |
-| path_traversal    | php        |    10 |           3 |  30.0% |
-| deserialize       | java       |     5 |           1 |  20.0% |
-
-All cells are well within the 80% budget. The OWASP Benchmark and SARD sets
-were not available at calibration time; ground truth files should be added to
-`tests/eval_corpus/ground_truth/` and `scripts/m7_ship_gate.sh` re-run when
-the corpora are downloaded.
-
-## False-Confirmed rate
-
-Based on feedback collected from maintainer machines via
-`nyx verify-feedback --wrong` during the M6.5 bughunt sessions:
-
-| Cap     | Confirmed | Wrong | Rate  |
-|---------|----------:|------:|------:|
-| sqli    |        34 |     0 |  0.0% |
-| xss     |        28 |     1 |  3.6% |
-| cmdi    |        12 |     0 |  0.0% |
-| ssrf    |         8 |     0 |  0.0% |
-| overall |        82 |     1 |  1.2% |
-
-The per-cap threshold is 2%. `xss` was 3.6% on a small sample (28 confirmed
-findings); a subsequent corpus update resolved the FP-causing payload variant.
-Rate at final calibration: 0/28 for xss.
-
-## Gate status at M7 merge
-
-All five pre-flip gates passed when `scripts/m7_ship_gate.sh` was run against
-the in-house curated set on the merge commit:
-
-1. **Unsupported rate** — all cells ≤ 80% ✓
-2. **False-Confirmed rate** — ≤ 2% per cap ✓
-3. **Wall-clock cost** — ≤ 2× static-only on benches/fixtures ✓
-4. **Sandbox-escape suite** — all escape fixtures `NotConfirmed` or `Unsupported` ✓
-5. **Repro stability** — 100% of in-house `Confirmed` findings regenerated identical verdict ✓
--- a/docs/recall-validation.md
+++ b/docs/recall-validation.md
@ -1,237 +0,0 @@
-# Recall validation runbook
-
-The recall-validation harness freezes a finding-shape baseline against
-real-world OSS targets so future engine work can prove "actually lifts
-recall on real code", not just "tests pass". This runbook covers
-re-running the validation against a fresh OSS release.
-
-## Targets
-
-| Target            | Clone URL                                  | Recall items exercised |
-|-------------------|--------------------------------------------|------------------------|
-| `cal_com`         | https://github.com/calcom/cal.com          | 1, 5, 6, 7             |
-| `vercel_commerce` | https://github.com/vercel/commerce         | 1, 4, 7                |
-| `shadcn_examples` | https://github.com/shadcn-ui/ui            | 4, 7                   |
-| `blitz_apps`      | https://github.com/blitz-js/blitz          | 1, 3, 6                |
-
-Item numbering is from `.pitboss/RECALL_GAPS.md`.
-
-## Files
-
-| File                                          | Role                                    |
-|-----------------------------------------------|-----------------------------------------|
-| `scripts/validate_recall.sh`                  | runner (capture + diff modes)           |
-| `tests/recall_targets/<target>.json`          | per-target baseline                     |
-| `tests/recall_gaps.rs::validate_real_world_targets` | schema-validity test (`#[ignore]`)|
-| `tests/recall_gaps_baseline.json`             | corpus regression baseline              |
-
-Baselines live next to the harness rather than under `.pitboss/`:
-pitboss implementer agents are forbidden to write under `.pitboss/`,
-so the baseline files were placed beside the test that consumes them.
-
-## Baseline schema
-
-```json
-{
-  "_doc": "...",
-  "target": "cal_com",
-  "clone_url": "https://github.com/calcom/cal.com",
-  "exercises_recall_items": [1, 5, 6, 7],
-  "captured_against": "real-scan @ <sha>",
-  "captured_on": "YYYY-MM-DD",
-  "pinned_commit": "<sha>",
-  "findings": [
-    {
-      "rule_id": "taint-unsanitised-flow",
-      "path_suffix": "packages/...",
-      "line": 130,
-      "severity": "High",
-      "verdict": "TP" | "FP" | "needs_review",
-      "note": "..."
-    }
-  ]
-}
-```
-
-The diff key is `(rule_id, path_suffix, line)`. The `verdict` field
-must be one of `TP`, `FP`, or `needs_review`; unknown verdicts are
-rejected by the schema test.
-
-## Usage
-
-### Diff a fresh scan against the frozen baseline
-
-```bash
-scripts/validate_recall.sh cal_com /path/to/cal.com
-```
-
-Output is a JSON object `{ added, removed, unchanged, *_total }`
-keyed by `rule_id`. Use this to spot intentional recall lift
-(`added`) and regressions (`removed`).
-
-### Refresh the baseline after an intentional recall lift
-
-```bash
-scripts/validate_recall.sh cal_com /path/to/cal.com --capture
-```
-
-This overwrites `tests/recall_targets/cal_com.json` with the current
-scan output. Every finding is re-marked `verdict: "needs_review"`;
-hand-label `TP`/`FP` afterwards as you triage.
-
-### Schema-validity check
-
-```bash
-cargo test --release --test recall_gaps -- --ignored validate_real_world_targets
-```
-
-Loads each per-target JSON, asserts the required keys exist, and
-asserts every finding carries a valid verdict label.
-
-## Refresh procedure
-
-1. Clone or pull the target repo into `~/oss/<target>` (or wherever).
-2. Build nyx: `cargo build --release`.
-3. Run the diff in plain mode to see what changed:
-   `scripts/validate_recall.sh <target> ~/oss/<target>`.
-4. If the lift is intentional, recapture:
-   `scripts/validate_recall.sh <target> ~/oss/<target> --capture`.
-5. Spot-check a handful of new findings. Open the file at
-   `path_suffix:line` and confirm the source-to-sink flow is real.
-   Hand-label them `TP`/`FP`.
-6. Commit the updated `tests/recall_targets/<target>.json`.
-
-## Known captured baselines (2026-05-08)
-
-| Target            | Pinned commit | Findings | TP | FP | needs_review |
-|-------------------|---------------|----------|----|----|--------------|
-| `cal_com`         | `d278d6c9`    | 662      | 0  | 4  | 658          |
-| `vercel_commerce` | unknown       | 0 (placeholder) |    |    |              |
-| `shadcn_examples` | unknown       | 0 (placeholder) |    |    |              |
-| `blitz_apps`      | unknown       | 0 (placeholder) |    |    |              |
-
-The `cal_com` capture used commit `d278d6c9bc535bf3f2c6ba0607654f78dd74d6ee`
-(`refactor: remove dead insights references (#29029)`). The 4 `FP`
-labels are `ts.crypto.math_random` hits inside `apps/web/playwright/`
-test fixtures, which are not a security context.
-
-The other three targets ship as placeholders (empty `findings`).
-Nobody has cloned them locally yet. Run `validate_recall.sh
-<target> <clone> --capture` to populate. The schema test still passes
-because `[]` is a valid `findings` array with zero entries to check.
-
-## Perf baseline
-
-The frozen JS-target perf snapshot lives in
-`tests/recall_targets/perf_after.txt`. Compare against the
-`captured_against` snapshot in `tests/recall_gaps_baseline.json`
-(`corpus_finding_lines.findings_total` = 1121, captured at master
-`ea82ea98`). The acceptance bar: scanner throughput on the existing
-`tests/fixtures/` corpus must regress by no more than 15%. Future
-recall work uses the same corpus and the same record file to measure
-its own perf delta.
-
-## Cross-language runbook
-
-The JS-target baselines above only cover JS/TS. Cross-language
-baselines mirror that work against real-world non-JS targets so
-multi-language engine changes can be measured against actual code,
-not just synthetic fixtures. Per-lang baselines live under
-`tests/recall_targets/xlang/<lang>/<target>.json` and the runner
-accepts a `--lang` flag to select the target set.
-
-### Cross-language targets
-
-| Lang   | Target       | Clone URL                                    | Pinned commit (capture) | Findings | Notes |
-|--------|--------------|----------------------------------------------|-------------------------|----------|-------|
-| php    | phpmyadmin   | https://github.com/phpmyadmin/phpmyadmin     | `ddf4e993`              | 119      | DBA UI; XSS / `php.deser` / `cfg-unguarded-sink` heavy. |
-| php    | joomla       | https://github.com/joomla/joomla-cms         | `7e8527d0`              | 83       | CMS; `php.deser.unserialize` and `php.path.include_variable` clusters. |
-| php    | drupal       | https://github.com/drupal/drupal             | `92aa759e`              | 635      | CMS / DI container; `cfg-unguarded-sink` (198) and `taint-prototype-pollution` (121) dominant. |
-| php    | nextcloud    | https://github.com/nextcloud/server          | `5c0fe4c3`              | 262      | File-sync platform; `cfg-resource-leak` / `state-resource-leak` heavy. |
-| java   | openmrs      | https://github.com/openmrs/openmrs-core      | `f9c76db2`              | 273      | Hibernate-heavy; JPA Criteria fix from `project_realrepo_openmrs.md` already applied. |
-| python | airflow      | https://github.com/apache/airflow            | `3d42610a`              | 892      | Scheduler / DAG runner; `cfg-unguarded-sink` (252) and `taint-unsanitised-flow` (179) lead. |
-| python | flask        | https://github.com/pallets/flask             | placeholder             | 0        | Smaller-surface Python framework; capture deferred. |
-| go     | gin          | https://github.com/gin-gonic/gin             | `d3ffc998`              | 20       | HTTP framework test corpus; `taint-header-injection` and TLS skip-verify in tests. |
-| rust   | axum         | https://github.com/tokio-rs/axum             | placeholder             | 0        | Not cloned in pitboss sandbox at capture time; populate locally. |
-| ruby   | rails        | https://github.com/rails/rails               | placeholder             | 0        | Capture against the `actionpack/` subtree once cloned. |
-
-Captures dated `2026-05-09` (UTC). Counts are deduplicated tuples
-`(rule_id, path_suffix, line)`. Duplicate raw findings collapse on
-the diff key, so the schema-test count and diff-mode `unchanged_total`
-may differ from the `findings | length` total by a handful of
-duplicate sites. The diff key is what matters for regression
-detection.
-
-### Per-lang TP/FP splits
-
-Every captured finding ships with `verdict: "needs_review"` from
-`--capture`. Hand-triage is bounded but pending; none of the cross-
-language captures are sweep-labelled yet. Use the per-lang dominant
-rule_id clusters above as the priority queue:
-
- **PHP**: `cfg-unguarded-sink` and `taint-prototype-pollution` are
-  the FP-dominant clusters across drupal / nextcloud / phpmyadmin
-  (CMS routing + JS object construction). `php.deser.unserialize` is
-  the highest-value TP cluster on joomla (17) and drupal (83). See
-  `project_realrepo_joomla.md` 2026-05-03 for the magic-method
-  passthrough fix that already filters one shape.
- **Java**: `taint-unsanitised-flow` (61) and `state-resource-leak`
-  (60) are openmrs's leading clusters. The JPA Criteria-API fix
-  already absorbed the `cfg-unguarded-sink` cluster (216 to 24);
-  remaining Hibernate / Spring resource-management FPs are the next
-  triage target.
- **Python**: `cfg-unguarded-sink` (252) on airflow is dominated by
-  Airflow's scheduler / DB plumbing; `py.auth.token_override_*`
-  (83) and `py.auth.missing_ownership_check` (61) are the auth-rule
-  noise typical of an admin/operator codebase.
- **Go**: gin's 20 findings are mostly test-corpus artifacts
-  (`gin_test.go`, `routes_test.go`); 4 of 4 `go.transport.insecure_skip_verify`
-  hits are inside `gin*_test.go` and are legitimate test setup.
- **Rust / Ruby**: placeholder. Capture once a local clone exists.
-
-### `--lang` runner usage
-
-```bash
-# diff mode (default)
-scripts/validate_recall.sh --lang php drupal /Users/me/oss/drupal
-scripts/validate_recall.sh --lang java openmrs /Users/me/oss/openmrs
-
-# capture / refresh
-scripts/validate_recall.sh --lang go gin /Users/me/oss/gin --capture
-```
-
-Output is the same `{ added, removed, unchanged, *_total }` JSON shape
-as the JS-target diff. The diff key is `(rule_id, path_suffix, line)`.
-
-### Cross-language refresh procedure
-
-1. Clone or update the target into `~/oss/<target>` (or wherever).
-2. Build nyx: `cargo build --release`.
-3. Diff vs the frozen baseline:
-   `scripts/validate_recall.sh --lang <lang> <target> ~/oss/<target>`.
-4. If the lift is intentional, recapture with `--capture`.
-5. Spot-check new findings; hand-label `TP`/`FP`.
-6. Commit the updated `tests/recall_targets/xlang/<lang>/<target>.json`.
-
-### Sandbox-capture caveat
-
-Pitboss implementer agents run sandboxed without network egress, so
-target repos that are not already present under `~/oss/` ship as
-placeholders (`pinned_commit: "unknown"`, `findings: []`). The
-current cross-language baselines cover php / java / python / go
-(every target whose repo was already cloned locally) and ship
-placeholders for `rust/axum`, `ruby/rails`, and `python/flask`. The
-schema test in `validate_real_world_targets` passes against
-placeholders because `[]` is a valid `findings` array.
-
-## What lives where (quick reference)
-
- Targets list and recall-item mapping in this file.
- Per-target JS findings under `tests/recall_targets/<target>.json`.
- Per-target cross-lang findings under `tests/recall_targets/xlang/<lang>/<target>.json`.
- Diff/capture runner at `scripts/validate_recall.sh` (accepts `--lang`).
- Schema-validity test at `tests/recall_gaps.rs::validate_real_world_targets`.
- Corpus regression baseline at `tests/recall_gaps_baseline.json`.
- Perf records at `tests/recall_targets/perf_after.txt` (JS-target
-  snapshot) and `tests/recall_targets/perf_after_xlang.txt`
-  (cross-language delta).
--- a/frontend/tsconfig.tsbuildinfo
+++ b/frontend/tsconfig.tsbuildinfo
@ -1 +1 @@
-{"root":["./src/app.tsx","./src/main.tsx","./src/vite-env.d.ts","./src/api/client.ts","./src/api/queryclient.ts","./src/api/types.ts","./src/api/mutations/baseline.ts","./src/api/mutations/config.ts","./src/api/mutations/rules.ts","./src/api/mutations/scans.ts","./src/api/mutations/triage.ts","./src/api/queries/config.ts","./src/api/queries/debug.ts","./src/api/queries/explorer.ts","./src/api/queries/findings.ts","./src/api/queries/health.ts","./src/api/queries/overview.ts","./src/api/queries/rules.ts","./src/api/queries/scans.ts","./src/api/queries/triage.ts","./src/components/copymarkdownbutton.tsx","./src/components/charts/horizontalbarchart.tsx","./src/components/charts/linechart.tsx","./src/components/data-display/codeviewer.tsx","./src/components/data-display/filetree.tsx","./src/components/explorer/analysisworkspace.tsx","./src/components/icons/icons.tsx","./src/components/layout/applayout.tsx","./src/components/layout/headerbar.tsx","./src/components/layout/sidebar.tsx","./src/components/overview/overviewwidgets.tsx","./src/components/ui/commandpalette.tsx","./src/components/ui/dropdown.tsx","./src/components/ui/emptystate.tsx","./src/components/ui/errorstate.tsx","./src/components/ui/loadingstate.tsx","./src/components/ui/modal.tsx","./src/components/ui/pagination.tsx","./src/components/ui/shortcutshelp.tsx","./src/components/ui/statcard.tsx","./src/components/ui/toaster.tsx","./src/contexts/ssecontext.tsx","./src/contexts/themecontext.tsx","./src/contexts/toastcontext.tsx","./src/graph/styles.ts","./src/graph/types.ts","./src/graph/adapters/callgraph.ts","./src/graph/adapters/cfg.ts","./src/graph/components/callgraphcanvas.tsx","./src/graph/components/cfggraphcanvas.tsx","./src/graph/components/graphtoolbar.tsx","./src/graph/hooks/useelklayout.ts","./src/graph/layout/elk.ts","./src/graph/layout/text.ts","./src/graph/reduction/cfgcompaction.ts","./src/graph/reduction/neighborhood.ts","./src/graph/rendering/sigma/sigmagraph.tsx","./src/graph/rendering/sigma/buildgraph.ts","./src/graph/rendering/sigma/edgeoverlay.ts","./src/hooks/usechordnavigation.ts","./src/hooks/usedebounce.ts","./src/hooks/usefiletree.ts","./src/hooks/usefindingsurlstate.ts","./src/hooks/usekeyboardshortcuts.ts","./src/hooks/usepagetitle.ts","./src/hooks/usepersistedstate.ts","./src/modals/codeviewermodal.tsx","./src/modals/newscanmodal.tsx","./src/pages/configpage.tsx","./src/pages/explorerpage.tsx","./src/pages/findingdetailpage.tsx","./src/pages/findingspage.tsx","./src/pages/overviewpage.tsx","./src/pages/rulespage.tsx","./src/pages/scancomparepage.tsx","./src/pages/scandetailpage.tsx","./src/pages/scanspage.tsx","./src/pages/triagepage.tsx","./src/pages/debug/abstractinterppage.tsx","./src/pages/debug/authanalysispage.tsx","./src/pages/debug/callgraphpage.tsx","./src/pages/debug/cfgviewerpage.tsx","./src/pages/debug/debuglayout.tsx","./src/pages/debug/functionselector.tsx","./src/pages/debug/pointerviewerpage.tsx","./src/pages/debug/ssaviewerpage.tsx","./src/pages/debug/summaryexplorerpage.tsx","./src/pages/debug/symexpage.tsx","./src/pages/debug/taintviewerpage.tsx","./src/pages/debug/typefactspage.tsx","./src/test/setup.ts","./src/test/api/client.test.ts","./src/test/components/pagination.test.tsx","./src/test/components/statcard.test.tsx","./src/test/components/statecomponents.test.tsx","./src/test/graph/cfgadapter.test.ts","./src/test/graph/compactgraph.test.ts","./src/test/graph/nodestyles.test.ts","./src/test/hooks/usedebounce.test.ts","./src/test/utils/findingmarkdown.test.ts","./src/test/utils/formatdate.test.ts","./src/test/utils/syntaxhighlight.test.ts","./src/test/utils/truncpath.test.ts","./src/utils/findingmarkdown.ts","./src/utils/formatdate.ts","./src/utils/parsenote.ts","./src/utils/syntaxhighlight.ts","./src/utils/truncpath.ts"],"version":"6.0.3"}
+{"root":["./src/app.tsx","./src/main.tsx","./src/vite-env.d.ts","./src/api/client.ts","./src/api/queryclient.ts","./src/api/types.ts","./src/api/mutations/baseline.ts","./src/api/mutations/config.ts","./src/api/mutations/rules.ts","./src/api/mutations/scans.ts","./src/api/mutations/triage.ts","./src/api/queries/config.ts","./src/api/queries/debug.ts","./src/api/queries/explorer.ts","./src/api/queries/findings.ts","./src/api/queries/health.ts","./src/api/queries/overview.ts","./src/api/queries/rules.ts","./src/api/queries/scans.ts","./src/api/queries/triage.ts","./src/components/copymarkdownbutton.tsx","./src/components/verdictbadge.tsx","./src/components/charts/horizontalbarchart.tsx","./src/components/charts/linechart.tsx","./src/components/data-display/codeviewer.tsx","./src/components/data-display/filetree.tsx","./src/components/explorer/analysisworkspace.tsx","./src/components/icons/icons.tsx","./src/components/layout/applayout.tsx","./src/components/layout/headerbar.tsx","./src/components/layout/sidebar.tsx","./src/components/overview/overviewwidgets.tsx","./src/components/ui/commandpalette.tsx","./src/components/ui/dropdown.tsx","./src/components/ui/emptystate.tsx","./src/components/ui/errorstate.tsx","./src/components/ui/loadingstate.tsx","./src/components/ui/modal.tsx","./src/components/ui/pagination.tsx","./src/components/ui/shortcutshelp.tsx","./src/components/ui/statcard.tsx","./src/components/ui/toaster.tsx","./src/contexts/ssecontext.tsx","./src/contexts/themecontext.tsx","./src/contexts/toastcontext.tsx","./src/graph/styles.ts","./src/graph/types.ts","./src/graph/adapters/callgraph.ts","./src/graph/adapters/cfg.ts","./src/graph/components/callgraphcanvas.tsx","./src/graph/components/cfggraphcanvas.tsx","./src/graph/components/graphtoolbar.tsx","./src/graph/hooks/useelklayout.ts","./src/graph/layout/elk.ts","./src/graph/layout/text.ts","./src/graph/reduction/cfgcompaction.ts","./src/graph/reduction/neighborhood.ts","./src/graph/rendering/sigma/sigmagraph.tsx","./src/graph/rendering/sigma/buildgraph.ts","./src/graph/rendering/sigma/edgeoverlay.ts","./src/hooks/usechordnavigation.ts","./src/hooks/usedebounce.ts","./src/hooks/usefiletree.ts","./src/hooks/usefindingsurlstate.ts","./src/hooks/usekeyboardshortcuts.ts","./src/hooks/usepagetitle.ts","./src/hooks/usepersistedstate.ts","./src/modals/codeviewermodal.tsx","./src/modals/newscanmodal.tsx","./src/pages/configpage.tsx","./src/pages/explorerpage.tsx","./src/pages/findingdetailpage.tsx","./src/pages/findingspage.tsx","./src/pages/overviewpage.tsx","./src/pages/rulespage.tsx","./src/pages/scancomparepage.tsx","./src/pages/scandetailpage.tsx","./src/pages/scanspage.tsx","./src/pages/triagepage.tsx","./src/pages/debug/abstractinterppage.tsx","./src/pages/debug/authanalysispage.tsx","./src/pages/debug/callgraphpage.tsx","./src/pages/debug/cfgviewerpage.tsx","./src/pages/debug/debuglayout.tsx","./src/pages/debug/functionselector.tsx","./src/pages/debug/pointerviewerpage.tsx","./src/pages/debug/ssaviewerpage.tsx","./src/pages/debug/summaryexplorerpage.tsx","./src/pages/debug/symexpage.tsx","./src/pages/debug/taintviewerpage.tsx","./src/pages/debug/typefactspage.tsx","./src/test/setup.ts","./src/test/api/client.test.ts","./src/test/components/pagination.test.tsx","./src/test/components/statcard.test.tsx","./src/test/components/dynamicverdictsection.test.tsx","./src/test/components/statecomponents.test.tsx","./src/test/components/verdictbadge.test.tsx","./src/test/graph/cfgadapter.test.ts","./src/test/graph/compactgraph.test.ts","./src/test/graph/nodestyles.test.ts","./src/test/hooks/usedebounce.test.ts","./src/test/modals/newscanmodal.test.tsx","./src/test/utils/findingmarkdown.test.ts","./src/test/utils/formatdate.test.ts","./src/test/utils/syntaxhighlight.test.ts","./src/test/utils/truncpath.test.ts","./src/utils/findingmarkdown.ts","./src/utils/formatdate.ts","./src/utils/parsenote.ts","./src/utils/syntaxhighlight.ts","./src/utils/truncpath.ts"],"version":"6.0.3"}
--- a/scripts/m7_ship_gate.sh
+++ b/scripts/m7_ship_gate.sh
@ -132,16 +132,19 @@ else
  if [[ ! -d "$BENCH_DIR" ]]; then
    info "Gate 3: benches/fixtures not found; skipping"
  else
+    # Portable epoch-millis. BSD date (macOS) lacks %3N; GNU date has it.
+    ms_now() { python3 -c 'import time; print(int(time.time()*1000))'; }
+
    # Static-only baseline.
-    T_STATIC_START=$(date +%s%3N)
+    T_STATIC_START=$(ms_now)
    "$NYX_BIN" scan --no-verify --format json --no-index "$BENCH_DIR" > /dev/null 2>&1 || true
-    T_STATIC_END=$(date +%s%3N)
+    T_STATIC_END=$(ms_now)
    T_STATIC=$(( T_STATIC_END - T_STATIC_START ))

    # Default (with verify).
-    T_VERIFY_START=$(date +%s%3N)
+    T_VERIFY_START=$(ms_now)
    "$NYX_BIN" scan --format json --no-index "$BENCH_DIR" > /dev/null 2>&1 || true
-    T_VERIFY_END=$(date +%s%3N)
+    T_VERIFY_END=$(ms_now)
    T_VERIFY=$(( T_VERIFY_END - T_VERIFY_START ))

    info "  static-only: ${T_STATIC}ms  with-verify: ${T_VERIFY}ms"
--- a/src/dynamic/toolchain.rs
+++ b/src/dynamic/toolchain.rs
@ -273,12 +273,17 @@ fn default_python() -> ToolchainResolution {
 fn extract_version_from_toml_value(line: &str) -> Option<String> {
    let after_eq = line.splitn(2, '=').nth(1)?;
    let raw = after_eq.trim().trim_matches('"').trim_matches('\'');
-    // Strip leading comparators: >=, <=, ==, ~=, ^, >
-    let ver = raw.trim_start_matches(|c: char| !c.is_ascii_digit());
-    if ver.is_empty() {
+    if raw.is_empty() {
        return None;
    }
-    Some(ver.to_owned())
+    // If the value begins with a digit (after stripping comparators), it is a
+    // semver pin like ">=1.75". Otherwise it is a channel name like "stable" /
+    // "nightly" / "beta" — return verbatim so `map_rust_version` can dispatch.
+    let trimmed = raw.trim_start_matches(|c: char| !c.is_ascii_digit() && !c.is_ascii_alphabetic());
+    if trimmed.starts_with(|c: char| c.is_ascii_digit()) {
+        return Some(trimmed.to_owned());
+    }
+    Some(trimmed.to_owned())
 }

 /// Map a raw version string to a Nyx reference toolchain ID.
@ -433,6 +438,13 @@ fn extract_version_from_json_value(line: &str) -> Option<String> {
    let after_colon = line.splitn(2, ':').nth(1)?;
    let raw = after_colon.trim().trim_matches('"').trim_matches('\'');
    let ver = raw.trim_start_matches(|c: char| !c.is_ascii_digit());
+    // Strip trailing junk: stop at the first char that isn't a version char.
+    // Handles single-line JSON like `{"php": ">=8.1"}}` where the previous
+    // trim still leaves `8.1"}}`.
+    let end = ver
+        .find(|c: char| !(c.is_ascii_digit() || c == '.' || c == '-'))
+        .unwrap_or(ver.len());
+    let ver = &ver[..end];
    // Strip trailing .x or .* wildcards.
    let ver = if let Some(pos) = ver.find(".x") {
        &ver[..pos]
--- a/tests/dynamic_parity.rs
+++ b/tests/dynamic_parity.rs
@ -104,6 +104,7 @@ mod parity_tests {
            },
            project_root: None,
            db_path: None,
+            verify_all_confidence: false,
        }
    }

@ -116,6 +117,7 @@ mod parity_tests {
            },
            project_root: None,
            db_path: None,
+            verify_all_confidence: false,
        }
    }

--- a/tests/dynamic_sandbox_escape.rs
+++ b/tests/dynamic_sandbox_escape.rs
@ -58,17 +58,15 @@ mod escape_tests {
            backend: SandboxBackend::Docker,
            env_passthrough: vec![],
            output_limit: 65536,
+            oob_listener: None,
        }
    }

-    /// Minimal no-op payload (escape scripts ignore NYX_PAYLOAD).
-    fn noop_payload() -> nyx_scanner::dynamic::corpus::Payload {
-        nyx_scanner::dynamic::corpus::Payload {
-            bytes: b"",
-            label: "escape-noop",
-            oracle: nyx_scanner::dynamic::corpus::Oracle::ExitStatus(1),
-            is_benign: true,
-        }
+    /// Minimal no-op payload bytes (escape scripts ignore NYX_PAYLOAD).
+    /// `sandbox::run` takes `&[u8]` directly; the CuratedPayload struct lives
+    /// one level up in the runner.
+    fn noop_payload() -> &'static [u8] {
+        b""
    }

    /// Copy a directory tree into a destination (creating it if needed).
--- a/tests/eval_corpus/ground_truth/owasp_benchmark_v1.2.json
+++ b/tests/eval_corpus/ground_truth/owasp_benchmark_v1.2.json
--- a/tests/eval_corpus/owasp_gt_convert.py
+++ b/tests/eval_corpus/owasp_gt_convert.py
@ -0,0 +1,97 @@
+#!/usr/bin/env python3
+"""Convert OWASP Benchmark v1.2 expectedresults-*.csv into nyx ground-truth JSON.
+
+Source: `expectedresults-1.2beta.csv` shipped in the BenchmarkJava repo.
+Output: list of `{path, line, cap, vuln}` records, where:
+  - `path` is the absolute path to the BenchmarkTest*.java under --corpus-dir.
+  - `line` is 0 (CSV does not pin a line; tabulate uses LINE_TOLERANCE on findings).
+  - `cap` is a nyx cap label mapped from the OWASP category column.
+  - `vuln` is True for `real vulnerability == true`, else False.
+
+Usage:
+  tests/eval_corpus/owasp_gt_convert.py \\
+      --corpus-dir ~/.cache/nyx/eval_corpus/owasp_benchmark_v1.2 \\
+      --output     tests/eval_corpus/ground_truth/owasp_benchmark_v1.2.json
+"""
+
+import argparse
+import csv
+import json
+import sys
+from pathlib import Path
+
+OWASP_TO_NYX_CAP = {
+    "cmdi":        "cmdi",
+    "crypto":      "crypto",
+    "hash":        "crypto",
+    "ldapi":       "ldap_injection",
+    "pathtraver":  "path_traversal",
+    "securecookie": "auth",
+    "sqli":        "sqli",
+    "trustbound":  "xss",
+    "weakrand":    "crypto",
+    "xpathi":      "xpath_injection",
+    "xss":         "xss",
+}
+
+
+def main() -> int:
+    p = argparse.ArgumentParser()
+    p.add_argument("--corpus-dir", required=True,
+                   help="Path to BenchmarkJava clone root.")
+    p.add_argument("--output", required=True,
+                   help="Output ground-truth JSON path.")
+    p.add_argument("--csv", default="",
+                   help="Override CSV path (default: <corpus-dir>/expectedresults-1.2beta.csv).")
+    args = p.parse_args()
+
+    corpus = Path(args.corpus_dir).expanduser().resolve()
+    csv_path = Path(args.csv) if args.csv else corpus / "expectedresults-1.2beta.csv"
+    if not csv_path.exists():
+        print(f"error: csv not found: {csv_path}", file=sys.stderr)
+        return 1
+
+    java_root = corpus / "src" / "main" / "java" / "org" / "owasp" / "benchmark" / "testcode"
+    if not java_root.is_dir():
+        print(f"error: java testcode dir not found: {java_root}", file=sys.stderr)
+        return 1
+
+    records: list[dict] = []
+    skipped = 0
+    with open(csv_path) as f:
+        reader = csv.reader(f)
+        next(reader, None)
+        for row in reader:
+            if len(row) < 3:
+                continue
+            name, category, real_vuln = row[0].strip(), row[1].strip(), row[2].strip().lower()
+            cap = OWASP_TO_NYX_CAP.get(category)
+            if cap is None:
+                skipped += 1
+                continue
+            java_file = java_root / f"{name}.java"
+            if not java_file.exists():
+                skipped += 1
+                continue
+            records.append({
+                "path": str(java_file),
+                "line": 0,
+                "cap":  cap,
+                "vuln": real_vuln == "true",
+            })
+
+    out = Path(args.output).expanduser().resolve()
+    out.parent.mkdir(parents=True, exist_ok=True)
+    with open(out, "w") as f:
+        json.dump(records, f, indent=2)
+
+    vuln_count = sum(1 for r in records if r["vuln"])
+    print(f"wrote {len(records)} records to {out}")
+    print(f"  vulns:    {vuln_count}")
+    print(f"  non-vuln: {len(records) - vuln_count}")
+    print(f"  skipped:  {skipped}")
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/tests/eval_corpus/run.sh
+++ b/tests/eval_corpus/run.sh
@ -147,7 +147,23 @@ fi
 # ── Emit summary table ────────────────────────────────────────────────────────
 info ""
 info "Results written to: $RESULTS_JSON"
-python3 "${SCRIPT_DIR}/report.py" --results "$RESULTS_JSON" \
-  || { info "report.py not available; raw results at $RESULTS_JSON"; exit 0; }

 [[ -n "$OUTPUT_DIR" ]] && cp "$RESULTS_JSON" "${OUTPUT_DIR}/eval_results.json"
+
+if [[ ! -f "${SCRIPT_DIR}/report.py" ]]; then
+  info "report.py not available; raw results at $RESULTS_JSON"
+  exit 0
+fi
+
+set +e
+python3 "${SCRIPT_DIR}/report.py" --results "$RESULTS_JSON"
+REPORT_RC=$?
+set -e
+# Propagate gate-fail (exit 2). Treat other non-zero as setup error (exit 1).
+if [[ $REPORT_RC -eq 2 ]]; then
+  exit 2
+elif [[ $REPORT_RC -ne 0 ]]; then
+  info "report.py crashed (exit $REPORT_RC); raw results at $RESULTS_JSON"
+  exit 1
+fi
+exit 0
--- a/tests/eval_corpus/sard_gt_convert.py
+++ b/tests/eval_corpus/sard_gt_convert.py
@ -0,0 +1,134 @@
+#!/usr/bin/env python3
+"""Convert NIST SARD manifest XML into nyx ground-truth JSON.
+
+SARD ships per-test-case `manifest.xml` files alongside source. Each
+`<testcase>` lists one or more `<file path="…">` entries with optional
+`<flaw line="…" name="CWE-XXX_…"/>` children.
+
+Output schema (consumed by tabulate.py):
+  list of {"path", "line", "cap", "vuln"} records.
+
+Usage:
+  tests/eval_corpus/sard_gt_convert.py \\
+      --corpus-dir ~/.cache/nyx/eval_corpus/nist_sard \\
+      --output     tests/eval_corpus/ground_truth/nist_sard.json
+"""
+
+import argparse
+import json
+import re
+import sys
+import xml.etree.ElementTree as ET
+from pathlib import Path
+
+CWE_TO_NYX_CAP = {
+    "20":  "validation",
+    "22":  "path_traversal",
+    "78":  "cmdi",
+    "79":  "xss",
+    "89":  "sqli",
+    "90":  "ldap_injection",
+    "91":  "xpath_injection",
+    "94":  "cmdi",
+    "113": "header_injection",
+    "117": "header_injection",
+    "190": "memory",
+    "200": "data_exfil",
+    "287": "auth",
+    "295": "crypto",
+    "311": "crypto",
+    "327": "crypto",
+    "328": "crypto",
+    "330": "crypto",
+    "352": "auth",
+    "434": "path_traversal",
+    "476": "memory",
+    "502": "deserialize",
+    "601": "redirect",
+    "611": "xxe",
+    "643": "xpath_injection",
+    "798": "crypto",
+    "918": "ssrf",
+}
+
+CWE_RE = re.compile(r"CWE[-_](\d+)", re.IGNORECASE)
+
+
+def cap_for_flaw(name: str) -> str | None:
+    m = CWE_RE.search(name or "")
+    if not m:
+        return None
+    return CWE_TO_NYX_CAP.get(m.group(1))
+
+
+def main() -> int:
+    p = argparse.ArgumentParser()
+    p.add_argument("--corpus-dir", required=True)
+    p.add_argument("--output", required=True)
+    args = p.parse_args()
+
+    root = Path(args.corpus_dir).expanduser().resolve()
+    if not root.is_dir():
+        print(f"error: corpus dir not found: {root}", file=sys.stderr)
+        return 1
+
+    records: list[dict] = []
+    skipped_files = 0
+    skipped_caps = 0
+
+    for manifest in root.rglob("manifest.xml"):
+        try:
+            tree = ET.parse(manifest)
+        except ET.ParseError as e:
+            print(f"warn: parse failed {manifest}: {e}", file=sys.stderr)
+            continue
+        for tc in tree.iter("testcase"):
+            for fnode in tc.iter("file"):
+                rel = fnode.get("path") or ""
+                if not rel:
+                    continue
+                abs_path = (manifest.parent / rel).resolve()
+                if not abs_path.exists():
+                    skipped_files += 1
+                    continue
+                flaws = list(fnode.iter("flaw")) + list(fnode.iter("mixed"))
+                if not flaws:
+                    records.append({
+                        "path": str(abs_path),
+                        "line": 0,
+                        "cap":  "other",
+                        "vuln": False,
+                    })
+                    continue
+                for flaw in flaws:
+                    cap = cap_for_flaw(flaw.get("name", ""))
+                    if cap is None:
+                        skipped_caps += 1
+                        continue
+                    try:
+                        line = int(flaw.get("line", "0") or 0)
+                    except ValueError:
+                        line = 0
+                    records.append({
+                        "path": str(abs_path),
+                        "line": line,
+                        "cap":  cap,
+                        "vuln": True,
+                    })
+
+    out = Path(args.output).expanduser().resolve()
+    out.parent.mkdir(parents=True, exist_ok=True)
+    with open(out, "w") as f:
+        json.dump(records, f, indent=2)
+
+    vuln_count = sum(1 for r in records if r["vuln"])
+    print(f"wrote {len(records)} records to {out}")
+    print(f"  vulns:           {vuln_count}")
+    print(f"  non-vuln:        {len(records) - vuln_count}")
+    print(f"  skipped (file):  {skipped_files}")
+    print(f"  skipped (cap):   {skipped_caps}")
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/tests/eval_corpus/tabulate.py
+++ b/tests/eval_corpus/tabulate.py
@ -19,25 +19,46 @@ from pathlib import Path

 LINE_TOLERANCE = 5

-_CAP_PREFIX_TABLE = [
-    ("taint.path_traversal", "path_traversal"),
-    ("taint.sql", "sqli"),
-    ("taint.xss", "xss"),
-    ("taint.ssrf", "ssrf"),
-    ("taint.cmdi", "cmdi"),
-    ("taint.deserialize", "deserialize"),
-    ("taint.redirect", "redirect"),
-    ("taint.xxe", "xxe"),
+# Bitflag positions for Cap (src/labels/mod.rs). Sink bits map to a cap label.
+_CAP_BIT_TABLE = [
+    (1 << 5,  "path_traversal"),  # FILE_IO
+    (1 << 6,  "fmt_string"),
+    (1 << 7,  "sqli"),             # SQL_QUERY
+    (1 << 8,  "deserialize"),
+    (1 << 9,  "ssrf"),
+    (1 << 10, "cmdi"),             # CODE_EXEC
+    (1 << 11, "crypto"),
+    (1 << 12, "unauthorized_id"),
+    (1 << 13, "data_exfil"),
+    (1 << 14, "ldap_injection"),
+    (1 << 15, "xpath_injection"),
+    (1 << 16, "header_injection"),
+    (1 << 17, "redirect"),         # OPEN_REDIRECT
+    (1 << 18, "xss"),              # SSTI (template_injection); also covers XSS sinks
+    (1 << 19, "xxe"),
+    (1 << 20, "prototype_pollution"),
+]
+
+# Substring → cap lookup for rule IDs. Order matters: most specific first.
+_CAP_RULE_TABLE = [
    ("path_traversal", "path_traversal"),
-    ("sqli", "sqli"),
-    ("xss", "xss"),
-    ("ssrf", "ssrf"),
-    ("cmdi", "cmdi"),
-    ("deserialize", "deserialize"),
-    ("redirect", "redirect"),
-    ("xxe", "xxe"),
-    ("auth", "auth"),
-    ("taint", "taint"),
+    ("sql",           "sqli"),
+    ("xss",           "xss"),
+    ("ssrf",          "ssrf"),
+    ("cmdi",          "cmdi"),
+    ("cmd_exec",      "cmdi"),
+    ("code_exec",     "cmdi"),
+    ("deser",         "deserialize"),
+    ("unserialize",   "deserialize"),
+    ("redirect",      "redirect"),
+    ("xxe",           "xxe"),
+    ("template",      "xss"),
+    ("auth",          "auth"),
+    ("memory",        "memory"),
+    ("crypto",        "crypto"),
+    ("data-exfil",    "data_exfil"),
+    ("data_exfil",    "data_exfil"),
+    ("header",        "header_injection"),
 ]


@ -47,9 +68,18 @@ def load_json(path: str) -> object:


 def cap_of(finding: dict) -> str:
-    rule = finding.get("rule_id", "").lower()
-    for prefix, cap in _CAP_PREFIX_TABLE:
-        if rule.startswith(prefix):
+    # 1. Prefer evidence.sink_caps bitmask — the engine's own classification.
+    ev = finding.get("evidence", {}) or {}
+    sink_caps = ev.get("sink_caps")
+    if isinstance(sink_caps, int) and sink_caps:
+        for bit, name in _CAP_BIT_TABLE:
+            if sink_caps & bit:
+                return name
+    # 2. Fall back to rule id substring (e.g. py.cmdi.os_system, java.deser.readobject).
+    rid = (finding.get("id") or "").lower()
+    head = rid.split(" ", 1)[0]
+    for needle, cap in _CAP_RULE_TABLE:
+        if needle in head:
            return cap
    return "other"

@ -122,8 +152,9 @@ def main() -> int:
            for idx, gt_entry in enumerate(gt_true):
                if (gt_entry["path"] == f_path
                        and gt_entry["cap"] == f_cap
-                        and abs(gt_entry["line"] - f_line) <= LINE_TOLERANCE
-                        and idx not in matched_gt):
+                        and idx not in matched_gt
+                        and (gt_entry["line"] == 0
+                             or abs(gt_entry["line"] - f_line) <= LINE_TOLERANCE)):
                    matched_idx = idx
                    break
            if matched_idx is not None:
				`@ -0,0 +1 @@`
				`{"sessionId":"3b3f9549-dbfc-4df7-8b4d-2b6393536381","pid":19723,"procStart":"Tue May 12 19:32:36 2026","acquiredAt":1778614799698}`