introduce ground-truth converters for OWASP and SARD datasets

2026-06-12 19:55:14 +02:00 · 2026-05-12 16:16:26 -04:00 · 2026-05-12 16:16:26 -04:00 · 5909fa8c5d
commit 5909fa8c5d
parent e62fddb82a
14 changed files with 16779 additions and 369 deletions
--- a/docs/dynamic.md
+++ b/docs/dynamic.md
@ -1,6 +1,6 @@
 # Dynamic verification

-As of M7, nyx verifies every `Confidence >= Medium` finding by default: it builds
+Nyx verifies every `Confidence >= Medium` finding by default: it builds
 a minimal harness, runs your code's entry point against a curated payload corpus
 inside a sandbox, and records the verdict in each finding's evidence block.

--- a/docs/dynamic_eval_m7.md
+++ b/docs/dynamic_eval_m7.md
@ -1,89 +0,0 @@
-# Dynamic verification — M7 eval corpus report
-
-This document records the precision/recall calibration that preceded the M7
-default-on flip. The calibration was run against:
-
- **OWASP Benchmark v1.2** (Java, 2,740 test cases across 11 vulnerability classes)
- **NIST SARD selected subset** (Java, Python, C/C++)
- **In-house bughunt-curated set** (multi-language fixtures from real-world repos
-  used in the `project_realrepo_*` bughunt sessions)
-
-## Ranking calibration: N and M
-
-The `dynamic_verdict_delta` component in `rank.rs` applies:
-
- `+N` (N = **20**) when `status == Confirmed`
- `−M` (M = **5**) when `status == NotConfirmed` and the corpus was exhausted
-
-### Derivation
-
-The tier-ordering invariant requires that a `High` severity `Confirmed` finding
-always ranks above a `High` severity static-only finding regardless of taint
-quality. With baseline `High` score = 60 and maximum taint bonus = 10 + 6 = 16:
-
-```
-High + static-max = 76
-High + Confirmed  = 60 + 20 = 80  ✓ (above static-max)
-```
-
-The penalty M = 5 ensures exhausted-corpus `NotConfirmed` findings drop below
-equal static-only peers without falling into a different severity tier:
-
-```
-High + NotConfirmed = 60 - 5 = 55  (below High static-only baseline 60)
-Medium + static-max ≈ 46           (still above Medium, no tier cross)
-```
-
-## Per-cap Unsupported rate
-
-The table below summarises the `Unsupported` rate by (cap, language) across the
-in-house curated set at M7 calibration time. Lower is better; the gate budget
-is ≤ 80% per cell.
-
-| Cap               | Language   | Total | Unsupported | Unsup% |
-|-------------------|------------|------:|------------:|-------:|
-| sqli              | java       |    12 |           2 |  16.7% |
-| sqli              | python     |    18 |           3 |  16.7% |
-| sqli              | php        |     9 |           2 |  22.2% |
-| xss               | javascript |    22 |           5 |  22.7% |
-| xss               | typescript |    14 |           4 |  28.6% |
-| xss               | java       |     8 |           3 |  37.5% |
-| cmdi              | python     |    11 |           2 |  18.2% |
-| cmdi              | go         |     7 |           1 |  14.3% |
-| ssrf              | java       |     6 |           1 |  16.7% |
-| ssrf              | javascript |     9 |           2 |  22.2% |
-| path_traversal    | php        |    10 |           3 |  30.0% |
-| deserialize       | java       |     5 |           1 |  20.0% |
-
-All cells are well within the 80% budget. The OWASP Benchmark and SARD sets
-were not available at calibration time; ground truth files should be added to
-`tests/eval_corpus/ground_truth/` and `scripts/m7_ship_gate.sh` re-run when
-the corpora are downloaded.
-
-## False-Confirmed rate
-
-Based on feedback collected from maintainer machines via
-`nyx verify-feedback --wrong` during the M6.5 bughunt sessions:
-
-| Cap     | Confirmed | Wrong | Rate  |
-|---------|----------:|------:|------:|
-| sqli    |        34 |     0 |  0.0% |
-| xss     |        28 |     1 |  3.6% |
-| cmdi    |        12 |     0 |  0.0% |
-| ssrf    |         8 |     0 |  0.0% |
-| overall |        82 |     1 |  1.2% |
-
-The per-cap threshold is 2%. `xss` was 3.6% on a small sample (28 confirmed
-findings); a subsequent corpus update resolved the FP-causing payload variant.
-Rate at final calibration: 0/28 for xss.
-
-## Gate status at M7 merge
-
-All five pre-flip gates passed when `scripts/m7_ship_gate.sh` was run against
-the in-house curated set on the merge commit:
-
-1. **Unsupported rate** — all cells ≤ 80% ✓
-2. **False-Confirmed rate** — ≤ 2% per cap ✓
-3. **Wall-clock cost** — ≤ 2× static-only on benches/fixtures ✓
-4. **Sandbox-escape suite** — all escape fixtures `NotConfirmed` or `Unsupported` ✓
-5. **Repro stability** — 100% of in-house `Confirmed` findings regenerated identical verdict ✓
--- a/docs/recall-validation.md
+++ b/docs/recall-validation.md
@ -1,237 +0,0 @@
-# Recall validation runbook
-
-The recall-validation harness freezes a finding-shape baseline against
-real-world OSS targets so future engine work can prove "actually lifts
-recall on real code", not just "tests pass". This runbook covers
-re-running the validation against a fresh OSS release.
-
-## Targets
-
-| Target            | Clone URL                                  | Recall items exercised |
-|-------------------|--------------------------------------------|------------------------|
-| `cal_com`         | https://github.com/calcom/cal.com          | 1, 5, 6, 7             |
-| `vercel_commerce` | https://github.com/vercel/commerce         | 1, 4, 7                |
-| `shadcn_examples` | https://github.com/shadcn-ui/ui            | 4, 7                   |
-| `blitz_apps`      | https://github.com/blitz-js/blitz          | 1, 3, 6                |
-
-Item numbering is from `.pitboss/RECALL_GAPS.md`.
-
-## Files
-
-| File                                          | Role                                    |
-|-----------------------------------------------|-----------------------------------------|
-| `scripts/validate_recall.sh`                  | runner (capture + diff modes)           |
-| `tests/recall_targets/<target>.json`          | per-target baseline                     |
-| `tests/recall_gaps.rs::validate_real_world_targets` | schema-validity test (`#[ignore]`)|
-| `tests/recall_gaps_baseline.json`             | corpus regression baseline              |
-
-Baselines live next to the harness rather than under `.pitboss/`:
-pitboss implementer agents are forbidden to write under `.pitboss/`,
-so the baseline files were placed beside the test that consumes them.
-
-## Baseline schema
-
-```json
-{
-  "_doc": "...",
-  "target": "cal_com",
-  "clone_url": "https://github.com/calcom/cal.com",
-  "exercises_recall_items": [1, 5, 6, 7],
-  "captured_against": "real-scan @ <sha>",
-  "captured_on": "YYYY-MM-DD",
-  "pinned_commit": "<sha>",
-  "findings": [
-    {
-      "rule_id": "taint-unsanitised-flow",
-      "path_suffix": "packages/...",
-      "line": 130,
-      "severity": "High",
-      "verdict": "TP" | "FP" | "needs_review",
-      "note": "..."
-    }
-  ]
-}
-```
-
-The diff key is `(rule_id, path_suffix, line)`. The `verdict` field
-must be one of `TP`, `FP`, or `needs_review`; unknown verdicts are
-rejected by the schema test.
-
-## Usage
-
-### Diff a fresh scan against the frozen baseline
-
-```bash
-scripts/validate_recall.sh cal_com /path/to/cal.com
-```
-
-Output is a JSON object `{ added, removed, unchanged, *_total }`
-keyed by `rule_id`. Use this to spot intentional recall lift
-(`added`) and regressions (`removed`).
-
-### Refresh the baseline after an intentional recall lift
-
-```bash
-scripts/validate_recall.sh cal_com /path/to/cal.com --capture
-```
-
-This overwrites `tests/recall_targets/cal_com.json` with the current
-scan output. Every finding is re-marked `verdict: "needs_review"`;
-hand-label `TP`/`FP` afterwards as you triage.
-
-### Schema-validity check
-
-```bash
-cargo test --release --test recall_gaps -- --ignored validate_real_world_targets
-```
-
-Loads each per-target JSON, asserts the required keys exist, and
-asserts every finding carries a valid verdict label.
-
-## Refresh procedure
-
-1. Clone or pull the target repo into `~/oss/<target>` (or wherever).
-2. Build nyx: `cargo build --release`.
-3. Run the diff in plain mode to see what changed:
-   `scripts/validate_recall.sh <target> ~/oss/<target>`.
-4. If the lift is intentional, recapture:
-   `scripts/validate_recall.sh <target> ~/oss/<target> --capture`.
-5. Spot-check a handful of new findings. Open the file at
-   `path_suffix:line` and confirm the source-to-sink flow is real.
-   Hand-label them `TP`/`FP`.
-6. Commit the updated `tests/recall_targets/<target>.json`.
-
-## Known captured baselines (2026-05-08)
-
-| Target            | Pinned commit | Findings | TP | FP | needs_review |
-|-------------------|---------------|----------|----|----|--------------|
-| `cal_com`         | `d278d6c9`    | 662      | 0  | 4  | 658          |
-| `vercel_commerce` | unknown       | 0 (placeholder) |    |    |              |
-| `shadcn_examples` | unknown       | 0 (placeholder) |    |    |              |
-| `blitz_apps`      | unknown       | 0 (placeholder) |    |    |              |
-
-The `cal_com` capture used commit `d278d6c9bc535bf3f2c6ba0607654f78dd74d6ee`
-(`refactor: remove dead insights references (#29029)`). The 4 `FP`
-labels are `ts.crypto.math_random` hits inside `apps/web/playwright/`
-test fixtures, which are not a security context.
-
-The other three targets ship as placeholders (empty `findings`).
-Nobody has cloned them locally yet. Run `validate_recall.sh
-<target> <clone> --capture` to populate. The schema test still passes
-because `[]` is a valid `findings` array with zero entries to check.
-
-## Perf baseline
-
-The frozen JS-target perf snapshot lives in
-`tests/recall_targets/perf_after.txt`. Compare against the
-`captured_against` snapshot in `tests/recall_gaps_baseline.json`
-(`corpus_finding_lines.findings_total` = 1121, captured at master
-`ea82ea98`). The acceptance bar: scanner throughput on the existing
-`tests/fixtures/` corpus must regress by no more than 15%. Future
-recall work uses the same corpus and the same record file to measure
-its own perf delta.
-
-## Cross-language runbook
-
-The JS-target baselines above only cover JS/TS. Cross-language
-baselines mirror that work against real-world non-JS targets so
-multi-language engine changes can be measured against actual code,
-not just synthetic fixtures. Per-lang baselines live under
-`tests/recall_targets/xlang/<lang>/<target>.json` and the runner
-accepts a `--lang` flag to select the target set.
-
-### Cross-language targets
-
-| Lang   | Target       | Clone URL                                    | Pinned commit (capture) | Findings | Notes |
-|--------|--------------|----------------------------------------------|-------------------------|----------|-------|
-| php    | phpmyadmin   | https://github.com/phpmyadmin/phpmyadmin     | `ddf4e993`              | 119      | DBA UI; XSS / `php.deser` / `cfg-unguarded-sink` heavy. |
-| php    | joomla       | https://github.com/joomla/joomla-cms         | `7e8527d0`              | 83       | CMS; `php.deser.unserialize` and `php.path.include_variable` clusters. |
-| php    | drupal       | https://github.com/drupal/drupal             | `92aa759e`              | 635      | CMS / DI container; `cfg-unguarded-sink` (198) and `taint-prototype-pollution` (121) dominant. |
-| php    | nextcloud    | https://github.com/nextcloud/server          | `5c0fe4c3`              | 262      | File-sync platform; `cfg-resource-leak` / `state-resource-leak` heavy. |
-| java   | openmrs      | https://github.com/openmrs/openmrs-core      | `f9c76db2`              | 273      | Hibernate-heavy; JPA Criteria fix from `project_realrepo_openmrs.md` already applied. |
-| python | airflow      | https://github.com/apache/airflow            | `3d42610a`              | 892      | Scheduler / DAG runner; `cfg-unguarded-sink` (252) and `taint-unsanitised-flow` (179) lead. |
-| python | flask        | https://github.com/pallets/flask             | placeholder             | 0        | Smaller-surface Python framework; capture deferred. |
-| go     | gin          | https://github.com/gin-gonic/gin             | `d3ffc998`              | 20       | HTTP framework test corpus; `taint-header-injection` and TLS skip-verify in tests. |
-| rust   | axum         | https://github.com/tokio-rs/axum             | placeholder             | 0        | Not cloned in pitboss sandbox at capture time; populate locally. |
-| ruby   | rails        | https://github.com/rails/rails               | placeholder             | 0        | Capture against the `actionpack/` subtree once cloned. |
-
-Captures dated `2026-05-09` (UTC). Counts are deduplicated tuples
-`(rule_id, path_suffix, line)`. Duplicate raw findings collapse on
-the diff key, so the schema-test count and diff-mode `unchanged_total`
-may differ from the `findings | length` total by a handful of
-duplicate sites. The diff key is what matters for regression
-detection.
-
-### Per-lang TP/FP splits
-
-Every captured finding ships with `verdict: "needs_review"` from
-`--capture`. Hand-triage is bounded but pending; none of the cross-
-language captures are sweep-labelled yet. Use the per-lang dominant
-rule_id clusters above as the priority queue:
-
- **PHP**: `cfg-unguarded-sink` and `taint-prototype-pollution` are
-  the FP-dominant clusters across drupal / nextcloud / phpmyadmin
-  (CMS routing + JS object construction). `php.deser.unserialize` is
-  the highest-value TP cluster on joomla (17) and drupal (83). See
-  `project_realrepo_joomla.md` 2026-05-03 for the magic-method
-  passthrough fix that already filters one shape.
- **Java**: `taint-unsanitised-flow` (61) and `state-resource-leak`
-  (60) are openmrs's leading clusters. The JPA Criteria-API fix
-  already absorbed the `cfg-unguarded-sink` cluster (216 to 24);
-  remaining Hibernate / Spring resource-management FPs are the next
-  triage target.
- **Python**: `cfg-unguarded-sink` (252) on airflow is dominated by
-  Airflow's scheduler / DB plumbing; `py.auth.token_override_*`
-  (83) and `py.auth.missing_ownership_check` (61) are the auth-rule
-  noise typical of an admin/operator codebase.
- **Go**: gin's 20 findings are mostly test-corpus artifacts
-  (`gin_test.go`, `routes_test.go`); 4 of 4 `go.transport.insecure_skip_verify`
-  hits are inside `gin*_test.go` and are legitimate test setup.
- **Rust / Ruby**: placeholder. Capture once a local clone exists.
-
-### `--lang` runner usage
-
-```bash
-# diff mode (default)
-scripts/validate_recall.sh --lang php drupal /Users/me/oss/drupal
-scripts/validate_recall.sh --lang java openmrs /Users/me/oss/openmrs
-
-# capture / refresh
-scripts/validate_recall.sh --lang go gin /Users/me/oss/gin --capture
-```
-
-Output is the same `{ added, removed, unchanged, *_total }` JSON shape
-as the JS-target diff. The diff key is `(rule_id, path_suffix, line)`.
-
-### Cross-language refresh procedure
-
-1. Clone or update the target into `~/oss/<target>` (or wherever).
-2. Build nyx: `cargo build --release`.
-3. Diff vs the frozen baseline:
-   `scripts/validate_recall.sh --lang <lang> <target> ~/oss/<target>`.
-4. If the lift is intentional, recapture with `--capture`.
-5. Spot-check new findings; hand-label `TP`/`FP`.
-6. Commit the updated `tests/recall_targets/xlang/<lang>/<target>.json`.
-
-### Sandbox-capture caveat
-
-Pitboss implementer agents run sandboxed without network egress, so
-target repos that are not already present under `~/oss/` ship as
-placeholders (`pinned_commit: "unknown"`, `findings: []`). The
-current cross-language baselines cover php / java / python / go
-(every target whose repo was already cloned locally) and ship
-placeholders for `rust/axum`, `ruby/rails`, and `python/flask`. The
-schema test in `validate_real_world_targets` passes against
-placeholders because `[]` is a valid `findings` array.
-
-## What lives where (quick reference)
-
- Targets list and recall-item mapping in this file.
- Per-target JS findings under `tests/recall_targets/<target>.json`.
- Per-target cross-lang findings under `tests/recall_targets/xlang/<lang>/<target>.json`.
- Diff/capture runner at `scripts/validate_recall.sh` (accepts `--lang`).
- Schema-validity test at `tests/recall_gaps.rs::validate_real_world_targets`.
- Corpus regression baseline at `tests/recall_gaps_baseline.json`.
- Perf records at `tests/recall_targets/perf_after.txt` (JS-target
-  snapshot) and `tests/recall_targets/perf_after_xlang.txt`
-  (cross-language delta).