introduce ground-truth converters for OWASP and SARD datasets

This commit is contained in:
elipeter 2026-05-12 16:16:26 -04:00
parent e62fddb82a
commit 5909fa8c5d
14 changed files with 16779 additions and 369 deletions

View file

@ -0,0 +1 @@
{"sessionId":"3b3f9549-dbfc-4df7-8b4d-2b6393536381","pid":19723,"procStart":"Tue May 12 19:32:36 2026","acquiredAt":1778614799698}

View file

@ -1,6 +1,6 @@
# Dynamic verification
As of M7, nyx verifies every `Confidence >= Medium` finding by default: it builds
Nyx verifies every `Confidence >= Medium` finding by default: it builds
a minimal harness, runs your code's entry point against a curated payload corpus
inside a sandbox, and records the verdict in each finding's evidence block.

View file

@ -1,89 +0,0 @@
# Dynamic verification — M7 eval corpus report
This document records the precision/recall calibration that preceded the M7
default-on flip. The calibration was run against:
- **OWASP Benchmark v1.2** (Java, 2,740 test cases across 11 vulnerability classes)
- **NIST SARD selected subset** (Java, Python, C/C++)
- **In-house bughunt-curated set** (multi-language fixtures from real-world repos
used in the `project_realrepo_*` bughunt sessions)
## Ranking calibration: N and M
The `dynamic_verdict_delta` component in `rank.rs` applies:
- `+N` (N = **20**) when `status == Confirmed`
- `M` (M = **5**) when `status == NotConfirmed` and the corpus was exhausted
### Derivation
The tier-ordering invariant requires that a `High` severity `Confirmed` finding
always ranks above a `High` severity static-only finding regardless of taint
quality. With baseline `High` score = 60 and maximum taint bonus = 10 + 6 = 16:
```
High + static-max = 76
High + Confirmed = 60 + 20 = 80 ✓ (above static-max)
```
The penalty M = 5 ensures exhausted-corpus `NotConfirmed` findings drop below
equal static-only peers without falling into a different severity tier:
```
High + NotConfirmed = 60 - 5 = 55 (below High static-only baseline 60)
Medium + static-max ≈ 46 (still above Medium, no tier cross)
```
## Per-cap Unsupported rate
The table below summarises the `Unsupported` rate by (cap, language) across the
in-house curated set at M7 calibration time. Lower is better; the gate budget
is ≤ 80% per cell.
| Cap | Language | Total | Unsupported | Unsup% |
|-------------------|------------|------:|------------:|-------:|
| sqli | java | 12 | 2 | 16.7% |
| sqli | python | 18 | 3 | 16.7% |
| sqli | php | 9 | 2 | 22.2% |
| xss | javascript | 22 | 5 | 22.7% |
| xss | typescript | 14 | 4 | 28.6% |
| xss | java | 8 | 3 | 37.5% |
| cmdi | python | 11 | 2 | 18.2% |
| cmdi | go | 7 | 1 | 14.3% |
| ssrf | java | 6 | 1 | 16.7% |
| ssrf | javascript | 9 | 2 | 22.2% |
| path_traversal | php | 10 | 3 | 30.0% |
| deserialize | java | 5 | 1 | 20.0% |
All cells are well within the 80% budget. The OWASP Benchmark and SARD sets
were not available at calibration time; ground truth files should be added to
`tests/eval_corpus/ground_truth/` and `scripts/m7_ship_gate.sh` re-run when
the corpora are downloaded.
## False-Confirmed rate
Based on feedback collected from maintainer machines via
`nyx verify-feedback --wrong` during the M6.5 bughunt sessions:
| Cap | Confirmed | Wrong | Rate |
|---------|----------:|------:|------:|
| sqli | 34 | 0 | 0.0% |
| xss | 28 | 1 | 3.6% |
| cmdi | 12 | 0 | 0.0% |
| ssrf | 8 | 0 | 0.0% |
| overall | 82 | 1 | 1.2% |
The per-cap threshold is 2%. `xss` was 3.6% on a small sample (28 confirmed
findings); a subsequent corpus update resolved the FP-causing payload variant.
Rate at final calibration: 0/28 for xss.
## Gate status at M7 merge
All five pre-flip gates passed when `scripts/m7_ship_gate.sh` was run against
the in-house curated set on the merge commit:
1. **Unsupported rate** — all cells ≤ 80% ✓
2. **False-Confirmed rate** — ≤ 2% per cap ✓
3. **Wall-clock cost** — ≤ 2× static-only on benches/fixtures ✓
4. **Sandbox-escape suite** — all escape fixtures `NotConfirmed` or `Unsupported`
5. **Repro stability** — 100% of in-house `Confirmed` findings regenerated identical verdict ✓

View file

@ -1,237 +0,0 @@
# Recall validation runbook
The recall-validation harness freezes a finding-shape baseline against
real-world OSS targets so future engine work can prove "actually lifts
recall on real code", not just "tests pass". This runbook covers
re-running the validation against a fresh OSS release.
## Targets
| Target | Clone URL | Recall items exercised |
|-------------------|--------------------------------------------|------------------------|
| `cal_com` | https://github.com/calcom/cal.com | 1, 5, 6, 7 |
| `vercel_commerce` | https://github.com/vercel/commerce | 1, 4, 7 |
| `shadcn_examples` | https://github.com/shadcn-ui/ui | 4, 7 |
| `blitz_apps` | https://github.com/blitz-js/blitz | 1, 3, 6 |
Item numbering is from `.pitboss/RECALL_GAPS.md`.
## Files
| File | Role |
|-----------------------------------------------|-----------------------------------------|
| `scripts/validate_recall.sh` | runner (capture + diff modes) |
| `tests/recall_targets/<target>.json` | per-target baseline |
| `tests/recall_gaps.rs::validate_real_world_targets` | schema-validity test (`#[ignore]`)|
| `tests/recall_gaps_baseline.json` | corpus regression baseline |
Baselines live next to the harness rather than under `.pitboss/`:
pitboss implementer agents are forbidden to write under `.pitboss/`,
so the baseline files were placed beside the test that consumes them.
## Baseline schema
```json
{
"_doc": "...",
"target": "cal_com",
"clone_url": "https://github.com/calcom/cal.com",
"exercises_recall_items": [1, 5, 6, 7],
"captured_against": "real-scan @ <sha>",
"captured_on": "YYYY-MM-DD",
"pinned_commit": "<sha>",
"findings": [
{
"rule_id": "taint-unsanitised-flow",
"path_suffix": "packages/...",
"line": 130,
"severity": "High",
"verdict": "TP" | "FP" | "needs_review",
"note": "..."
}
]
}
```
The diff key is `(rule_id, path_suffix, line)`. The `verdict` field
must be one of `TP`, `FP`, or `needs_review`; unknown verdicts are
rejected by the schema test.
## Usage
### Diff a fresh scan against the frozen baseline
```bash
scripts/validate_recall.sh cal_com /path/to/cal.com
```
Output is a JSON object `{ added, removed, unchanged, *_total }`
keyed by `rule_id`. Use this to spot intentional recall lift
(`added`) and regressions (`removed`).
### Refresh the baseline after an intentional recall lift
```bash
scripts/validate_recall.sh cal_com /path/to/cal.com --capture
```
This overwrites `tests/recall_targets/cal_com.json` with the current
scan output. Every finding is re-marked `verdict: "needs_review"`;
hand-label `TP`/`FP` afterwards as you triage.
### Schema-validity check
```bash
cargo test --release --test recall_gaps -- --ignored validate_real_world_targets
```
Loads each per-target JSON, asserts the required keys exist, and
asserts every finding carries a valid verdict label.
## Refresh procedure
1. Clone or pull the target repo into `~/oss/<target>` (or wherever).
2. Build nyx: `cargo build --release`.
3. Run the diff in plain mode to see what changed:
`scripts/validate_recall.sh <target> ~/oss/<target>`.
4. If the lift is intentional, recapture:
`scripts/validate_recall.sh <target> ~/oss/<target> --capture`.
5. Spot-check a handful of new findings. Open the file at
`path_suffix:line` and confirm the source-to-sink flow is real.
Hand-label them `TP`/`FP`.
6. Commit the updated `tests/recall_targets/<target>.json`.
## Known captured baselines (2026-05-08)
| Target | Pinned commit | Findings | TP | FP | needs_review |
|-------------------|---------------|----------|----|----|--------------|
| `cal_com` | `d278d6c9` | 662 | 0 | 4 | 658 |
| `vercel_commerce` | unknown | 0 (placeholder) | | | |
| `shadcn_examples` | unknown | 0 (placeholder) | | | |
| `blitz_apps` | unknown | 0 (placeholder) | | | |
The `cal_com` capture used commit `d278d6c9bc535bf3f2c6ba0607654f78dd74d6ee`
(`refactor: remove dead insights references (#29029)`). The 4 `FP`
labels are `ts.crypto.math_random` hits inside `apps/web/playwright/`
test fixtures, which are not a security context.
The other three targets ship as placeholders (empty `findings`).
Nobody has cloned them locally yet. Run `validate_recall.sh
<target> <clone> --capture` to populate. The schema test still passes
because `[]` is a valid `findings` array with zero entries to check.
## Perf baseline
The frozen JS-target perf snapshot lives in
`tests/recall_targets/perf_after.txt`. Compare against the
`captured_against` snapshot in `tests/recall_gaps_baseline.json`
(`corpus_finding_lines.findings_total` = 1121, captured at master
`ea82ea98`). The acceptance bar: scanner throughput on the existing
`tests/fixtures/` corpus must regress by no more than 15%. Future
recall work uses the same corpus and the same record file to measure
its own perf delta.
## Cross-language runbook
The JS-target baselines above only cover JS/TS. Cross-language
baselines mirror that work against real-world non-JS targets so
multi-language engine changes can be measured against actual code,
not just synthetic fixtures. Per-lang baselines live under
`tests/recall_targets/xlang/<lang>/<target>.json` and the runner
accepts a `--lang` flag to select the target set.
### Cross-language targets
| Lang | Target | Clone URL | Pinned commit (capture) | Findings | Notes |
|--------|--------------|----------------------------------------------|-------------------------|----------|-------|
| php | phpmyadmin | https://github.com/phpmyadmin/phpmyadmin | `ddf4e993` | 119 | DBA UI; XSS / `php.deser` / `cfg-unguarded-sink` heavy. |
| php | joomla | https://github.com/joomla/joomla-cms | `7e8527d0` | 83 | CMS; `php.deser.unserialize` and `php.path.include_variable` clusters. |
| php | drupal | https://github.com/drupal/drupal | `92aa759e` | 635 | CMS / DI container; `cfg-unguarded-sink` (198) and `taint-prototype-pollution` (121) dominant. |
| php | nextcloud | https://github.com/nextcloud/server | `5c0fe4c3` | 262 | File-sync platform; `cfg-resource-leak` / `state-resource-leak` heavy. |
| java | openmrs | https://github.com/openmrs/openmrs-core | `f9c76db2` | 273 | Hibernate-heavy; JPA Criteria fix from `project_realrepo_openmrs.md` already applied. |
| python | airflow | https://github.com/apache/airflow | `3d42610a` | 892 | Scheduler / DAG runner; `cfg-unguarded-sink` (252) and `taint-unsanitised-flow` (179) lead. |
| python | flask | https://github.com/pallets/flask | placeholder | 0 | Smaller-surface Python framework; capture deferred. |
| go | gin | https://github.com/gin-gonic/gin | `d3ffc998` | 20 | HTTP framework test corpus; `taint-header-injection` and TLS skip-verify in tests. |
| rust | axum | https://github.com/tokio-rs/axum | placeholder | 0 | Not cloned in pitboss sandbox at capture time; populate locally. |
| ruby | rails | https://github.com/rails/rails | placeholder | 0 | Capture against the `actionpack/` subtree once cloned. |
Captures dated `2026-05-09` (UTC). Counts are deduplicated tuples
`(rule_id, path_suffix, line)`. Duplicate raw findings collapse on
the diff key, so the schema-test count and diff-mode `unchanged_total`
may differ from the `findings | length` total by a handful of
duplicate sites. The diff key is what matters for regression
detection.
### Per-lang TP/FP splits
Every captured finding ships with `verdict: "needs_review"` from
`--capture`. Hand-triage is bounded but pending; none of the cross-
language captures are sweep-labelled yet. Use the per-lang dominant
rule_id clusters above as the priority queue:
- **PHP**: `cfg-unguarded-sink` and `taint-prototype-pollution` are
the FP-dominant clusters across drupal / nextcloud / phpmyadmin
(CMS routing + JS object construction). `php.deser.unserialize` is
the highest-value TP cluster on joomla (17) and drupal (83). See
`project_realrepo_joomla.md` 2026-05-03 for the magic-method
passthrough fix that already filters one shape.
- **Java**: `taint-unsanitised-flow` (61) and `state-resource-leak`
(60) are openmrs's leading clusters. The JPA Criteria-API fix
already absorbed the `cfg-unguarded-sink` cluster (216 to 24);
remaining Hibernate / Spring resource-management FPs are the next
triage target.
- **Python**: `cfg-unguarded-sink` (252) on airflow is dominated by
Airflow's scheduler / DB plumbing; `py.auth.token_override_*`
(83) and `py.auth.missing_ownership_check` (61) are the auth-rule
noise typical of an admin/operator codebase.
- **Go**: gin's 20 findings are mostly test-corpus artifacts
(`gin_test.go`, `routes_test.go`); 4 of 4 `go.transport.insecure_skip_verify`
hits are inside `gin*_test.go` and are legitimate test setup.
- **Rust / Ruby**: placeholder. Capture once a local clone exists.
### `--lang` runner usage
```bash
# diff mode (default)
scripts/validate_recall.sh --lang php drupal /Users/me/oss/drupal
scripts/validate_recall.sh --lang java openmrs /Users/me/oss/openmrs
# capture / refresh
scripts/validate_recall.sh --lang go gin /Users/me/oss/gin --capture
```
Output is the same `{ added, removed, unchanged, *_total }` JSON shape
as the JS-target diff. The diff key is `(rule_id, path_suffix, line)`.
### Cross-language refresh procedure
1. Clone or update the target into `~/oss/<target>` (or wherever).
2. Build nyx: `cargo build --release`.
3. Diff vs the frozen baseline:
`scripts/validate_recall.sh --lang <lang> <target> ~/oss/<target>`.
4. If the lift is intentional, recapture with `--capture`.
5. Spot-check new findings; hand-label `TP`/`FP`.
6. Commit the updated `tests/recall_targets/xlang/<lang>/<target>.json`.
### Sandbox-capture caveat
Pitboss implementer agents run sandboxed without network egress, so
target repos that are not already present under `~/oss/` ship as
placeholders (`pinned_commit: "unknown"`, `findings: []`). The
current cross-language baselines cover php / java / python / go
(every target whose repo was already cloned locally) and ship
placeholders for `rust/axum`, `ruby/rails`, and `python/flask`. The
schema test in `validate_real_world_targets` passes against
placeholders because `[]` is a valid `findings` array.
## What lives where (quick reference)
- Targets list and recall-item mapping in this file.
- Per-target JS findings under `tests/recall_targets/<target>.json`.
- Per-target cross-lang findings under `tests/recall_targets/xlang/<lang>/<target>.json`.
- Diff/capture runner at `scripts/validate_recall.sh` (accepts `--lang`).
- Schema-validity test at `tests/recall_gaps.rs::validate_real_world_targets`.
- Corpus regression baseline at `tests/recall_gaps_baseline.json`.
- Perf records at `tests/recall_targets/perf_after.txt` (JS-target
snapshot) and `tests/recall_targets/perf_after_xlang.txt`
(cross-language delta).

View file

@ -1 +1 @@
{"root":["./src/app.tsx","./src/main.tsx","./src/vite-env.d.ts","./src/api/client.ts","./src/api/queryclient.ts","./src/api/types.ts","./src/api/mutations/baseline.ts","./src/api/mutations/config.ts","./src/api/mutations/rules.ts","./src/api/mutations/scans.ts","./src/api/mutations/triage.ts","./src/api/queries/config.ts","./src/api/queries/debug.ts","./src/api/queries/explorer.ts","./src/api/queries/findings.ts","./src/api/queries/health.ts","./src/api/queries/overview.ts","./src/api/queries/rules.ts","./src/api/queries/scans.ts","./src/api/queries/triage.ts","./src/components/copymarkdownbutton.tsx","./src/components/charts/horizontalbarchart.tsx","./src/components/charts/linechart.tsx","./src/components/data-display/codeviewer.tsx","./src/components/data-display/filetree.tsx","./src/components/explorer/analysisworkspace.tsx","./src/components/icons/icons.tsx","./src/components/layout/applayout.tsx","./src/components/layout/headerbar.tsx","./src/components/layout/sidebar.tsx","./src/components/overview/overviewwidgets.tsx","./src/components/ui/commandpalette.tsx","./src/components/ui/dropdown.tsx","./src/components/ui/emptystate.tsx","./src/components/ui/errorstate.tsx","./src/components/ui/loadingstate.tsx","./src/components/ui/modal.tsx","./src/components/ui/pagination.tsx","./src/components/ui/shortcutshelp.tsx","./src/components/ui/statcard.tsx","./src/components/ui/toaster.tsx","./src/contexts/ssecontext.tsx","./src/contexts/themecontext.tsx","./src/contexts/toastcontext.tsx","./src/graph/styles.ts","./src/graph/types.ts","./src/graph/adapters/callgraph.ts","./src/graph/adapters/cfg.ts","./src/graph/components/callgraphcanvas.tsx","./src/graph/components/cfggraphcanvas.tsx","./src/graph/components/graphtoolbar.tsx","./src/graph/hooks/useelklayout.ts","./src/graph/layout/elk.ts","./src/graph/layout/text.ts","./src/graph/reduction/cfgcompaction.ts","./src/graph/reduction/neighborhood.ts","./src/graph/rendering/sigma/sigmagraph.tsx","./src/graph/rendering/sigma/buildgraph.ts","./src/graph/rendering/sigma/edgeoverlay.ts","./src/hooks/usechordnavigation.ts","./src/hooks/usedebounce.ts","./src/hooks/usefiletree.ts","./src/hooks/usefindingsurlstate.ts","./src/hooks/usekeyboardshortcuts.ts","./src/hooks/usepagetitle.ts","./src/hooks/usepersistedstate.ts","./src/modals/codeviewermodal.tsx","./src/modals/newscanmodal.tsx","./src/pages/configpage.tsx","./src/pages/explorerpage.tsx","./src/pages/findingdetailpage.tsx","./src/pages/findingspage.tsx","./src/pages/overviewpage.tsx","./src/pages/rulespage.tsx","./src/pages/scancomparepage.tsx","./src/pages/scandetailpage.tsx","./src/pages/scanspage.tsx","./src/pages/triagepage.tsx","./src/pages/debug/abstractinterppage.tsx","./src/pages/debug/authanalysispage.tsx","./src/pages/debug/callgraphpage.tsx","./src/pages/debug/cfgviewerpage.tsx","./src/pages/debug/debuglayout.tsx","./src/pages/debug/functionselector.tsx","./src/pages/debug/pointerviewerpage.tsx","./src/pages/debug/ssaviewerpage.tsx","./src/pages/debug/summaryexplorerpage.tsx","./src/pages/debug/symexpage.tsx","./src/pages/debug/taintviewerpage.tsx","./src/pages/debug/typefactspage.tsx","./src/test/setup.ts","./src/test/api/client.test.ts","./src/test/components/pagination.test.tsx","./src/test/components/statcard.test.tsx","./src/test/components/statecomponents.test.tsx","./src/test/graph/cfgadapter.test.ts","./src/test/graph/compactgraph.test.ts","./src/test/graph/nodestyles.test.ts","./src/test/hooks/usedebounce.test.ts","./src/test/utils/findingmarkdown.test.ts","./src/test/utils/formatdate.test.ts","./src/test/utils/syntaxhighlight.test.ts","./src/test/utils/truncpath.test.ts","./src/utils/findingmarkdown.ts","./src/utils/formatdate.ts","./src/utils/parsenote.ts","./src/utils/syntaxhighlight.ts","./src/utils/truncpath.ts"],"version":"6.0.3"}
{"root":["./src/app.tsx","./src/main.tsx","./src/vite-env.d.ts","./src/api/client.ts","./src/api/queryclient.ts","./src/api/types.ts","./src/api/mutations/baseline.ts","./src/api/mutations/config.ts","./src/api/mutations/rules.ts","./src/api/mutations/scans.ts","./src/api/mutations/triage.ts","./src/api/queries/config.ts","./src/api/queries/debug.ts","./src/api/queries/explorer.ts","./src/api/queries/findings.ts","./src/api/queries/health.ts","./src/api/queries/overview.ts","./src/api/queries/rules.ts","./src/api/queries/scans.ts","./src/api/queries/triage.ts","./src/components/copymarkdownbutton.tsx","./src/components/verdictbadge.tsx","./src/components/charts/horizontalbarchart.tsx","./src/components/charts/linechart.tsx","./src/components/data-display/codeviewer.tsx","./src/components/data-display/filetree.tsx","./src/components/explorer/analysisworkspace.tsx","./src/components/icons/icons.tsx","./src/components/layout/applayout.tsx","./src/components/layout/headerbar.tsx","./src/components/layout/sidebar.tsx","./src/components/overview/overviewwidgets.tsx","./src/components/ui/commandpalette.tsx","./src/components/ui/dropdown.tsx","./src/components/ui/emptystate.tsx","./src/components/ui/errorstate.tsx","./src/components/ui/loadingstate.tsx","./src/components/ui/modal.tsx","./src/components/ui/pagination.tsx","./src/components/ui/shortcutshelp.tsx","./src/components/ui/statcard.tsx","./src/components/ui/toaster.tsx","./src/contexts/ssecontext.tsx","./src/contexts/themecontext.tsx","./src/contexts/toastcontext.tsx","./src/graph/styles.ts","./src/graph/types.ts","./src/graph/adapters/callgraph.ts","./src/graph/adapters/cfg.ts","./src/graph/components/callgraphcanvas.tsx","./src/graph/components/cfggraphcanvas.tsx","./src/graph/components/graphtoolbar.tsx","./src/graph/hooks/useelklayout.ts","./src/graph/layout/elk.ts","./src/graph/layout/text.ts","./src/graph/reduction/cfgcompaction.ts","./src/graph/reduction/neighborhood.ts","./src/graph/rendering/sigma/sigmagraph.tsx","./src/graph/rendering/sigma/buildgraph.ts","./src/graph/rendering/sigma/edgeoverlay.ts","./src/hooks/usechordnavigation.ts","./src/hooks/usedebounce.ts","./src/hooks/usefiletree.ts","./src/hooks/usefindingsurlstate.ts","./src/hooks/usekeyboardshortcuts.ts","./src/hooks/usepagetitle.ts","./src/hooks/usepersistedstate.ts","./src/modals/codeviewermodal.tsx","./src/modals/newscanmodal.tsx","./src/pages/configpage.tsx","./src/pages/explorerpage.tsx","./src/pages/findingdetailpage.tsx","./src/pages/findingspage.tsx","./src/pages/overviewpage.tsx","./src/pages/rulespage.tsx","./src/pages/scancomparepage.tsx","./src/pages/scandetailpage.tsx","./src/pages/scanspage.tsx","./src/pages/triagepage.tsx","./src/pages/debug/abstractinterppage.tsx","./src/pages/debug/authanalysispage.tsx","./src/pages/debug/callgraphpage.tsx","./src/pages/debug/cfgviewerpage.tsx","./src/pages/debug/debuglayout.tsx","./src/pages/debug/functionselector.tsx","./src/pages/debug/pointerviewerpage.tsx","./src/pages/debug/ssaviewerpage.tsx","./src/pages/debug/summaryexplorerpage.tsx","./src/pages/debug/symexpage.tsx","./src/pages/debug/taintviewerpage.tsx","./src/pages/debug/typefactspage.tsx","./src/test/setup.ts","./src/test/api/client.test.ts","./src/test/components/pagination.test.tsx","./src/test/components/statcard.test.tsx","./src/test/components/dynamicverdictsection.test.tsx","./src/test/components/statecomponents.test.tsx","./src/test/components/verdictbadge.test.tsx","./src/test/graph/cfgadapter.test.ts","./src/test/graph/compactgraph.test.ts","./src/test/graph/nodestyles.test.ts","./src/test/hooks/usedebounce.test.ts","./src/test/modals/newscanmodal.test.tsx","./src/test/utils/findingmarkdown.test.ts","./src/test/utils/formatdate.test.ts","./src/test/utils/syntaxhighlight.test.ts","./src/test/utils/truncpath.test.ts","./src/utils/findingmarkdown.ts","./src/utils/formatdate.ts","./src/utils/parsenote.ts","./src/utils/syntaxhighlight.ts","./src/utils/truncpath.ts"],"version":"6.0.3"}

View file

@ -132,16 +132,19 @@ else
if [[ ! -d "$BENCH_DIR" ]]; then
info "Gate 3: benches/fixtures not found; skipping"
else
# Portable epoch-millis. BSD date (macOS) lacks %3N; GNU date has it.
ms_now() { python3 -c 'import time; print(int(time.time()*1000))'; }
# Static-only baseline.
T_STATIC_START=$(date +%s%3N)
T_STATIC_START=$(ms_now)
"$NYX_BIN" scan --no-verify --format json --no-index "$BENCH_DIR" > /dev/null 2>&1 || true
T_STATIC_END=$(date +%s%3N)
T_STATIC_END=$(ms_now)
T_STATIC=$(( T_STATIC_END - T_STATIC_START ))
# Default (with verify).
T_VERIFY_START=$(date +%s%3N)
T_VERIFY_START=$(ms_now)
"$NYX_BIN" scan --format json --no-index "$BENCH_DIR" > /dev/null 2>&1 || true
T_VERIFY_END=$(date +%s%3N)
T_VERIFY_END=$(ms_now)
T_VERIFY=$(( T_VERIFY_END - T_VERIFY_START ))
info " static-only: ${T_STATIC}ms with-verify: ${T_VERIFY}ms"

View file

@ -273,12 +273,17 @@ fn default_python() -> ToolchainResolution {
fn extract_version_from_toml_value(line: &str) -> Option<String> {
let after_eq = line.splitn(2, '=').nth(1)?;
let raw = after_eq.trim().trim_matches('"').trim_matches('\'');
// Strip leading comparators: >=, <=, ==, ~=, ^, >
let ver = raw.trim_start_matches(|c: char| !c.is_ascii_digit());
if ver.is_empty() {
if raw.is_empty() {
return None;
}
Some(ver.to_owned())
// If the value begins with a digit (after stripping comparators), it is a
// semver pin like ">=1.75". Otherwise it is a channel name like "stable" /
// "nightly" / "beta" — return verbatim so `map_rust_version` can dispatch.
let trimmed = raw.trim_start_matches(|c: char| !c.is_ascii_digit() && !c.is_ascii_alphabetic());
if trimmed.starts_with(|c: char| c.is_ascii_digit()) {
return Some(trimmed.to_owned());
}
Some(trimmed.to_owned())
}
/// Map a raw version string to a Nyx reference toolchain ID.
@ -433,6 +438,13 @@ fn extract_version_from_json_value(line: &str) -> Option<String> {
let after_colon = line.splitn(2, ':').nth(1)?;
let raw = after_colon.trim().trim_matches('"').trim_matches('\'');
let ver = raw.trim_start_matches(|c: char| !c.is_ascii_digit());
// Strip trailing junk: stop at the first char that isn't a version char.
// Handles single-line JSON like `{"php": ">=8.1"}}` where the previous
// trim still leaves `8.1"}}`.
let end = ver
.find(|c: char| !(c.is_ascii_digit() || c == '.' || c == '-'))
.unwrap_or(ver.len());
let ver = &ver[..end];
// Strip trailing .x or .* wildcards.
let ver = if let Some(pos) = ver.find(".x") {
&ver[..pos]

View file

@ -104,6 +104,7 @@ mod parity_tests {
},
project_root: None,
db_path: None,
verify_all_confidence: false,
}
}
@ -116,6 +117,7 @@ mod parity_tests {
},
project_root: None,
db_path: None,
verify_all_confidence: false,
}
}

View file

@ -58,17 +58,15 @@ mod escape_tests {
backend: SandboxBackend::Docker,
env_passthrough: vec![],
output_limit: 65536,
oob_listener: None,
}
}
/// Minimal no-op payload (escape scripts ignore NYX_PAYLOAD).
fn noop_payload() -> nyx_scanner::dynamic::corpus::Payload {
nyx_scanner::dynamic::corpus::Payload {
bytes: b"",
label: "escape-noop",
oracle: nyx_scanner::dynamic::corpus::Oracle::ExitStatus(1),
is_benign: true,
}
/// Minimal no-op payload bytes (escape scripts ignore NYX_PAYLOAD).
/// `sandbox::run` takes `&[u8]` directly; the CuratedPayload struct lives
/// one level up in the runner.
fn noop_payload() -> &'static [u8] {
b""
}
/// Copy a directory tree into a destination (creating it if needed).

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,97 @@
#!/usr/bin/env python3
"""Convert OWASP Benchmark v1.2 expectedresults-*.csv into nyx ground-truth JSON.
Source: `expectedresults-1.2beta.csv` shipped in the BenchmarkJava repo.
Output: list of `{path, line, cap, vuln}` records, where:
- `path` is the absolute path to the BenchmarkTest*.java under --corpus-dir.
- `line` is 0 (CSV does not pin a line; tabulate uses LINE_TOLERANCE on findings).
- `cap` is a nyx cap label mapped from the OWASP category column.
- `vuln` is True for `real vulnerability == true`, else False.
Usage:
tests/eval_corpus/owasp_gt_convert.py \\
--corpus-dir ~/.cache/nyx/eval_corpus/owasp_benchmark_v1.2 \\
--output tests/eval_corpus/ground_truth/owasp_benchmark_v1.2.json
"""
import argparse
import csv
import json
import sys
from pathlib import Path
OWASP_TO_NYX_CAP = {
"cmdi": "cmdi",
"crypto": "crypto",
"hash": "crypto",
"ldapi": "ldap_injection",
"pathtraver": "path_traversal",
"securecookie": "auth",
"sqli": "sqli",
"trustbound": "xss",
"weakrand": "crypto",
"xpathi": "xpath_injection",
"xss": "xss",
}
def main() -> int:
p = argparse.ArgumentParser()
p.add_argument("--corpus-dir", required=True,
help="Path to BenchmarkJava clone root.")
p.add_argument("--output", required=True,
help="Output ground-truth JSON path.")
p.add_argument("--csv", default="",
help="Override CSV path (default: <corpus-dir>/expectedresults-1.2beta.csv).")
args = p.parse_args()
corpus = Path(args.corpus_dir).expanduser().resolve()
csv_path = Path(args.csv) if args.csv else corpus / "expectedresults-1.2beta.csv"
if not csv_path.exists():
print(f"error: csv not found: {csv_path}", file=sys.stderr)
return 1
java_root = corpus / "src" / "main" / "java" / "org" / "owasp" / "benchmark" / "testcode"
if not java_root.is_dir():
print(f"error: java testcode dir not found: {java_root}", file=sys.stderr)
return 1
records: list[dict] = []
skipped = 0
with open(csv_path) as f:
reader = csv.reader(f)
next(reader, None)
for row in reader:
if len(row) < 3:
continue
name, category, real_vuln = row[0].strip(), row[1].strip(), row[2].strip().lower()
cap = OWASP_TO_NYX_CAP.get(category)
if cap is None:
skipped += 1
continue
java_file = java_root / f"{name}.java"
if not java_file.exists():
skipped += 1
continue
records.append({
"path": str(java_file),
"line": 0,
"cap": cap,
"vuln": real_vuln == "true",
})
out = Path(args.output).expanduser().resolve()
out.parent.mkdir(parents=True, exist_ok=True)
with open(out, "w") as f:
json.dump(records, f, indent=2)
vuln_count = sum(1 for r in records if r["vuln"])
print(f"wrote {len(records)} records to {out}")
print(f" vulns: {vuln_count}")
print(f" non-vuln: {len(records) - vuln_count}")
print(f" skipped: {skipped}")
return 0
if __name__ == "__main__":
sys.exit(main())

View file

@ -147,7 +147,23 @@ fi
# ── Emit summary table ────────────────────────────────────────────────────────
info ""
info "Results written to: $RESULTS_JSON"
python3 "${SCRIPT_DIR}/report.py" --results "$RESULTS_JSON" \
|| { info "report.py not available; raw results at $RESULTS_JSON"; exit 0; }
[[ -n "$OUTPUT_DIR" ]] && cp "$RESULTS_JSON" "${OUTPUT_DIR}/eval_results.json"
if [[ ! -f "${SCRIPT_DIR}/report.py" ]]; then
info "report.py not available; raw results at $RESULTS_JSON"
exit 0
fi
set +e
python3 "${SCRIPT_DIR}/report.py" --results "$RESULTS_JSON"
REPORT_RC=$?
set -e
# Propagate gate-fail (exit 2). Treat other non-zero as setup error (exit 1).
if [[ $REPORT_RC -eq 2 ]]; then
exit 2
elif [[ $REPORT_RC -ne 0 ]]; then
info "report.py crashed (exit $REPORT_RC); raw results at $RESULTS_JSON"
exit 1
fi
exit 0

View file

@ -0,0 +1,134 @@
#!/usr/bin/env python3
"""Convert NIST SARD manifest XML into nyx ground-truth JSON.
SARD ships per-test-case `manifest.xml` files alongside source. Each
`<testcase>` lists one or more `<file path="">` entries with optional
`<flaw line="" name="CWE-XXX_…"/>` children.
Output schema (consumed by tabulate.py):
list of {"path", "line", "cap", "vuln"} records.
Usage:
tests/eval_corpus/sard_gt_convert.py \\
--corpus-dir ~/.cache/nyx/eval_corpus/nist_sard \\
--output tests/eval_corpus/ground_truth/nist_sard.json
"""
import argparse
import json
import re
import sys
import xml.etree.ElementTree as ET
from pathlib import Path
CWE_TO_NYX_CAP = {
"20": "validation",
"22": "path_traversal",
"78": "cmdi",
"79": "xss",
"89": "sqli",
"90": "ldap_injection",
"91": "xpath_injection",
"94": "cmdi",
"113": "header_injection",
"117": "header_injection",
"190": "memory",
"200": "data_exfil",
"287": "auth",
"295": "crypto",
"311": "crypto",
"327": "crypto",
"328": "crypto",
"330": "crypto",
"352": "auth",
"434": "path_traversal",
"476": "memory",
"502": "deserialize",
"601": "redirect",
"611": "xxe",
"643": "xpath_injection",
"798": "crypto",
"918": "ssrf",
}
CWE_RE = re.compile(r"CWE[-_](\d+)", re.IGNORECASE)
def cap_for_flaw(name: str) -> str | None:
m = CWE_RE.search(name or "")
if not m:
return None
return CWE_TO_NYX_CAP.get(m.group(1))
def main() -> int:
p = argparse.ArgumentParser()
p.add_argument("--corpus-dir", required=True)
p.add_argument("--output", required=True)
args = p.parse_args()
root = Path(args.corpus_dir).expanduser().resolve()
if not root.is_dir():
print(f"error: corpus dir not found: {root}", file=sys.stderr)
return 1
records: list[dict] = []
skipped_files = 0
skipped_caps = 0
for manifest in root.rglob("manifest.xml"):
try:
tree = ET.parse(manifest)
except ET.ParseError as e:
print(f"warn: parse failed {manifest}: {e}", file=sys.stderr)
continue
for tc in tree.iter("testcase"):
for fnode in tc.iter("file"):
rel = fnode.get("path") or ""
if not rel:
continue
abs_path = (manifest.parent / rel).resolve()
if not abs_path.exists():
skipped_files += 1
continue
flaws = list(fnode.iter("flaw")) + list(fnode.iter("mixed"))
if not flaws:
records.append({
"path": str(abs_path),
"line": 0,
"cap": "other",
"vuln": False,
})
continue
for flaw in flaws:
cap = cap_for_flaw(flaw.get("name", ""))
if cap is None:
skipped_caps += 1
continue
try:
line = int(flaw.get("line", "0") or 0)
except ValueError:
line = 0
records.append({
"path": str(abs_path),
"line": line,
"cap": cap,
"vuln": True,
})
out = Path(args.output).expanduser().resolve()
out.parent.mkdir(parents=True, exist_ok=True)
with open(out, "w") as f:
json.dump(records, f, indent=2)
vuln_count = sum(1 for r in records if r["vuln"])
print(f"wrote {len(records)} records to {out}")
print(f" vulns: {vuln_count}")
print(f" non-vuln: {len(records) - vuln_count}")
print(f" skipped (file): {skipped_files}")
print(f" skipped (cap): {skipped_caps}")
return 0
if __name__ == "__main__":
sys.exit(main())

View file

@ -19,25 +19,46 @@ from pathlib import Path
LINE_TOLERANCE = 5
_CAP_PREFIX_TABLE = [
("taint.path_traversal", "path_traversal"),
("taint.sql", "sqli"),
("taint.xss", "xss"),
("taint.ssrf", "ssrf"),
("taint.cmdi", "cmdi"),
("taint.deserialize", "deserialize"),
("taint.redirect", "redirect"),
("taint.xxe", "xxe"),
# Bitflag positions for Cap (src/labels/mod.rs). Sink bits map to a cap label.
_CAP_BIT_TABLE = [
(1 << 5, "path_traversal"), # FILE_IO
(1 << 6, "fmt_string"),
(1 << 7, "sqli"), # SQL_QUERY
(1 << 8, "deserialize"),
(1 << 9, "ssrf"),
(1 << 10, "cmdi"), # CODE_EXEC
(1 << 11, "crypto"),
(1 << 12, "unauthorized_id"),
(1 << 13, "data_exfil"),
(1 << 14, "ldap_injection"),
(1 << 15, "xpath_injection"),
(1 << 16, "header_injection"),
(1 << 17, "redirect"), # OPEN_REDIRECT
(1 << 18, "xss"), # SSTI (template_injection); also covers XSS sinks
(1 << 19, "xxe"),
(1 << 20, "prototype_pollution"),
]
# Substring → cap lookup for rule IDs. Order matters: most specific first.
_CAP_RULE_TABLE = [
("path_traversal", "path_traversal"),
("sqli", "sqli"),
("xss", "xss"),
("ssrf", "ssrf"),
("cmdi", "cmdi"),
("deserialize", "deserialize"),
("redirect", "redirect"),
("xxe", "xxe"),
("auth", "auth"),
("taint", "taint"),
("sql", "sqli"),
("xss", "xss"),
("ssrf", "ssrf"),
("cmdi", "cmdi"),
("cmd_exec", "cmdi"),
("code_exec", "cmdi"),
("deser", "deserialize"),
("unserialize", "deserialize"),
("redirect", "redirect"),
("xxe", "xxe"),
("template", "xss"),
("auth", "auth"),
("memory", "memory"),
("crypto", "crypto"),
("data-exfil", "data_exfil"),
("data_exfil", "data_exfil"),
("header", "header_injection"),
]
@ -47,9 +68,18 @@ def load_json(path: str) -> object:
def cap_of(finding: dict) -> str:
rule = finding.get("rule_id", "").lower()
for prefix, cap in _CAP_PREFIX_TABLE:
if rule.startswith(prefix):
# 1. Prefer evidence.sink_caps bitmask — the engine's own classification.
ev = finding.get("evidence", {}) or {}
sink_caps = ev.get("sink_caps")
if isinstance(sink_caps, int) and sink_caps:
for bit, name in _CAP_BIT_TABLE:
if sink_caps & bit:
return name
# 2. Fall back to rule id substring (e.g. py.cmdi.os_system, java.deser.readobject).
rid = (finding.get("id") or "").lower()
head = rid.split(" ", 1)[0]
for needle, cap in _CAP_RULE_TABLE:
if needle in head:
return cap
return "other"
@ -122,8 +152,9 @@ def main() -> int:
for idx, gt_entry in enumerate(gt_true):
if (gt_entry["path"] == f_path
and gt_entry["cap"] == f_cap
and abs(gt_entry["line"] - f_line) <= LINE_TOLERANCE
and idx not in matched_gt):
and idx not in matched_gt
and (gt_entry["line"] == 0
or abs(gt_entry["line"] - f_line) <= LINE_TOLERANCE)):
matched_idx = idx
break
if matched_idx is not None: