mirror of
https://github.com/elicpeter/nyx.git
synced 2026-06-12 19:55:14 +02:00
introduce ground-truth converters for OWASP and SARD datasets
This commit is contained in:
parent
e62fddb82a
commit
5909fa8c5d
14 changed files with 16779 additions and 369 deletions
|
|
@ -1,6 +1,6 @@
|
|||
# Dynamic verification
|
||||
|
||||
As of M7, nyx verifies every `Confidence >= Medium` finding by default: it builds
|
||||
Nyx verifies every `Confidence >= Medium` finding by default: it builds
|
||||
a minimal harness, runs your code's entry point against a curated payload corpus
|
||||
inside a sandbox, and records the verdict in each finding's evidence block.
|
||||
|
||||
|
|
|
|||
|
|
@ -1,89 +0,0 @@
|
|||
# Dynamic verification — M7 eval corpus report
|
||||
|
||||
This document records the precision/recall calibration that preceded the M7
|
||||
default-on flip. The calibration was run against:
|
||||
|
||||
- **OWASP Benchmark v1.2** (Java, 2,740 test cases across 11 vulnerability classes)
|
||||
- **NIST SARD selected subset** (Java, Python, C/C++)
|
||||
- **In-house bughunt-curated set** (multi-language fixtures from real-world repos
|
||||
used in the `project_realrepo_*` bughunt sessions)
|
||||
|
||||
## Ranking calibration: N and M
|
||||
|
||||
The `dynamic_verdict_delta` component in `rank.rs` applies:
|
||||
|
||||
- `+N` (N = **20**) when `status == Confirmed`
|
||||
- `−M` (M = **5**) when `status == NotConfirmed` and the corpus was exhausted
|
||||
|
||||
### Derivation
|
||||
|
||||
The tier-ordering invariant requires that a `High` severity `Confirmed` finding
|
||||
always ranks above a `High` severity static-only finding regardless of taint
|
||||
quality. With baseline `High` score = 60 and maximum taint bonus = 10 + 6 = 16:
|
||||
|
||||
```
|
||||
High + static-max = 76
|
||||
High + Confirmed = 60 + 20 = 80 ✓ (above static-max)
|
||||
```
|
||||
|
||||
The penalty M = 5 ensures exhausted-corpus `NotConfirmed` findings drop below
|
||||
equal static-only peers without falling into a different severity tier:
|
||||
|
||||
```
|
||||
High + NotConfirmed = 60 - 5 = 55 (below High static-only baseline 60)
|
||||
Medium + static-max ≈ 46 (still above Medium, no tier cross)
|
||||
```
|
||||
|
||||
## Per-cap Unsupported rate
|
||||
|
||||
The table below summarises the `Unsupported` rate by (cap, language) across the
|
||||
in-house curated set at M7 calibration time. Lower is better; the gate budget
|
||||
is ≤ 80% per cell.
|
||||
|
||||
| Cap | Language | Total | Unsupported | Unsup% |
|
||||
|-------------------|------------|------:|------------:|-------:|
|
||||
| sqli | java | 12 | 2 | 16.7% |
|
||||
| sqli | python | 18 | 3 | 16.7% |
|
||||
| sqli | php | 9 | 2 | 22.2% |
|
||||
| xss | javascript | 22 | 5 | 22.7% |
|
||||
| xss | typescript | 14 | 4 | 28.6% |
|
||||
| xss | java | 8 | 3 | 37.5% |
|
||||
| cmdi | python | 11 | 2 | 18.2% |
|
||||
| cmdi | go | 7 | 1 | 14.3% |
|
||||
| ssrf | java | 6 | 1 | 16.7% |
|
||||
| ssrf | javascript | 9 | 2 | 22.2% |
|
||||
| path_traversal | php | 10 | 3 | 30.0% |
|
||||
| deserialize | java | 5 | 1 | 20.0% |
|
||||
|
||||
All cells are well within the 80% budget. The OWASP Benchmark and SARD sets
|
||||
were not available at calibration time; ground truth files should be added to
|
||||
`tests/eval_corpus/ground_truth/` and `scripts/m7_ship_gate.sh` re-run when
|
||||
the corpora are downloaded.
|
||||
|
||||
## False-Confirmed rate
|
||||
|
||||
Based on feedback collected from maintainer machines via
|
||||
`nyx verify-feedback --wrong` during the M6.5 bughunt sessions:
|
||||
|
||||
| Cap | Confirmed | Wrong | Rate |
|
||||
|---------|----------:|------:|------:|
|
||||
| sqli | 34 | 0 | 0.0% |
|
||||
| xss | 28 | 1 | 3.6% |
|
||||
| cmdi | 12 | 0 | 0.0% |
|
||||
| ssrf | 8 | 0 | 0.0% |
|
||||
| overall | 82 | 1 | 1.2% |
|
||||
|
||||
The per-cap threshold is 2%. `xss` was 3.6% on a small sample (28 confirmed
|
||||
findings); a subsequent corpus update resolved the FP-causing payload variant.
|
||||
Rate at final calibration: 0/28 for xss.
|
||||
|
||||
## Gate status at M7 merge
|
||||
|
||||
All five pre-flip gates passed when `scripts/m7_ship_gate.sh` was run against
|
||||
the in-house curated set on the merge commit:
|
||||
|
||||
1. **Unsupported rate** — all cells ≤ 80% ✓
|
||||
2. **False-Confirmed rate** — ≤ 2% per cap ✓
|
||||
3. **Wall-clock cost** — ≤ 2× static-only on benches/fixtures ✓
|
||||
4. **Sandbox-escape suite** — all escape fixtures `NotConfirmed` or `Unsupported` ✓
|
||||
5. **Repro stability** — 100% of in-house `Confirmed` findings regenerated identical verdict ✓
|
||||
|
|
@ -1,237 +0,0 @@
|
|||
# Recall validation runbook
|
||||
|
||||
The recall-validation harness freezes a finding-shape baseline against
|
||||
real-world OSS targets so future engine work can prove "actually lifts
|
||||
recall on real code", not just "tests pass". This runbook covers
|
||||
re-running the validation against a fresh OSS release.
|
||||
|
||||
## Targets
|
||||
|
||||
| Target | Clone URL | Recall items exercised |
|
||||
|-------------------|--------------------------------------------|------------------------|
|
||||
| `cal_com` | https://github.com/calcom/cal.com | 1, 5, 6, 7 |
|
||||
| `vercel_commerce` | https://github.com/vercel/commerce | 1, 4, 7 |
|
||||
| `shadcn_examples` | https://github.com/shadcn-ui/ui | 4, 7 |
|
||||
| `blitz_apps` | https://github.com/blitz-js/blitz | 1, 3, 6 |
|
||||
|
||||
Item numbering is from `.pitboss/RECALL_GAPS.md`.
|
||||
|
||||
## Files
|
||||
|
||||
| File | Role |
|
||||
|-----------------------------------------------|-----------------------------------------|
|
||||
| `scripts/validate_recall.sh` | runner (capture + diff modes) |
|
||||
| `tests/recall_targets/<target>.json` | per-target baseline |
|
||||
| `tests/recall_gaps.rs::validate_real_world_targets` | schema-validity test (`#[ignore]`)|
|
||||
| `tests/recall_gaps_baseline.json` | corpus regression baseline |
|
||||
|
||||
Baselines live next to the harness rather than under `.pitboss/`:
|
||||
pitboss implementer agents are forbidden to write under `.pitboss/`,
|
||||
so the baseline files were placed beside the test that consumes them.
|
||||
|
||||
## Baseline schema
|
||||
|
||||
```json
|
||||
{
|
||||
"_doc": "...",
|
||||
"target": "cal_com",
|
||||
"clone_url": "https://github.com/calcom/cal.com",
|
||||
"exercises_recall_items": [1, 5, 6, 7],
|
||||
"captured_against": "real-scan @ <sha>",
|
||||
"captured_on": "YYYY-MM-DD",
|
||||
"pinned_commit": "<sha>",
|
||||
"findings": [
|
||||
{
|
||||
"rule_id": "taint-unsanitised-flow",
|
||||
"path_suffix": "packages/...",
|
||||
"line": 130,
|
||||
"severity": "High",
|
||||
"verdict": "TP" | "FP" | "needs_review",
|
||||
"note": "..."
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The diff key is `(rule_id, path_suffix, line)`. The `verdict` field
|
||||
must be one of `TP`, `FP`, or `needs_review`; unknown verdicts are
|
||||
rejected by the schema test.
|
||||
|
||||
## Usage
|
||||
|
||||
### Diff a fresh scan against the frozen baseline
|
||||
|
||||
```bash
|
||||
scripts/validate_recall.sh cal_com /path/to/cal.com
|
||||
```
|
||||
|
||||
Output is a JSON object `{ added, removed, unchanged, *_total }`
|
||||
keyed by `rule_id`. Use this to spot intentional recall lift
|
||||
(`added`) and regressions (`removed`).
|
||||
|
||||
### Refresh the baseline after an intentional recall lift
|
||||
|
||||
```bash
|
||||
scripts/validate_recall.sh cal_com /path/to/cal.com --capture
|
||||
```
|
||||
|
||||
This overwrites `tests/recall_targets/cal_com.json` with the current
|
||||
scan output. Every finding is re-marked `verdict: "needs_review"`;
|
||||
hand-label `TP`/`FP` afterwards as you triage.
|
||||
|
||||
### Schema-validity check
|
||||
|
||||
```bash
|
||||
cargo test --release --test recall_gaps -- --ignored validate_real_world_targets
|
||||
```
|
||||
|
||||
Loads each per-target JSON, asserts the required keys exist, and
|
||||
asserts every finding carries a valid verdict label.
|
||||
|
||||
## Refresh procedure
|
||||
|
||||
1. Clone or pull the target repo into `~/oss/<target>` (or wherever).
|
||||
2. Build nyx: `cargo build --release`.
|
||||
3. Run the diff in plain mode to see what changed:
|
||||
`scripts/validate_recall.sh <target> ~/oss/<target>`.
|
||||
4. If the lift is intentional, recapture:
|
||||
`scripts/validate_recall.sh <target> ~/oss/<target> --capture`.
|
||||
5. Spot-check a handful of new findings. Open the file at
|
||||
`path_suffix:line` and confirm the source-to-sink flow is real.
|
||||
Hand-label them `TP`/`FP`.
|
||||
6. Commit the updated `tests/recall_targets/<target>.json`.
|
||||
|
||||
## Known captured baselines (2026-05-08)
|
||||
|
||||
| Target | Pinned commit | Findings | TP | FP | needs_review |
|
||||
|-------------------|---------------|----------|----|----|--------------|
|
||||
| `cal_com` | `d278d6c9` | 662 | 0 | 4 | 658 |
|
||||
| `vercel_commerce` | unknown | 0 (placeholder) | | | |
|
||||
| `shadcn_examples` | unknown | 0 (placeholder) | | | |
|
||||
| `blitz_apps` | unknown | 0 (placeholder) | | | |
|
||||
|
||||
The `cal_com` capture used commit `d278d6c9bc535bf3f2c6ba0607654f78dd74d6ee`
|
||||
(`refactor: remove dead insights references (#29029)`). The 4 `FP`
|
||||
labels are `ts.crypto.math_random` hits inside `apps/web/playwright/`
|
||||
test fixtures, which are not a security context.
|
||||
|
||||
The other three targets ship as placeholders (empty `findings`).
|
||||
Nobody has cloned them locally yet. Run `validate_recall.sh
|
||||
<target> <clone> --capture` to populate. The schema test still passes
|
||||
because `[]` is a valid `findings` array with zero entries to check.
|
||||
|
||||
## Perf baseline
|
||||
|
||||
The frozen JS-target perf snapshot lives in
|
||||
`tests/recall_targets/perf_after.txt`. Compare against the
|
||||
`captured_against` snapshot in `tests/recall_gaps_baseline.json`
|
||||
(`corpus_finding_lines.findings_total` = 1121, captured at master
|
||||
`ea82ea98`). The acceptance bar: scanner throughput on the existing
|
||||
`tests/fixtures/` corpus must regress by no more than 15%. Future
|
||||
recall work uses the same corpus and the same record file to measure
|
||||
its own perf delta.
|
||||
|
||||
## Cross-language runbook
|
||||
|
||||
The JS-target baselines above only cover JS/TS. Cross-language
|
||||
baselines mirror that work against real-world non-JS targets so
|
||||
multi-language engine changes can be measured against actual code,
|
||||
not just synthetic fixtures. Per-lang baselines live under
|
||||
`tests/recall_targets/xlang/<lang>/<target>.json` and the runner
|
||||
accepts a `--lang` flag to select the target set.
|
||||
|
||||
### Cross-language targets
|
||||
|
||||
| Lang | Target | Clone URL | Pinned commit (capture) | Findings | Notes |
|
||||
|--------|--------------|----------------------------------------------|-------------------------|----------|-------|
|
||||
| php | phpmyadmin | https://github.com/phpmyadmin/phpmyadmin | `ddf4e993` | 119 | DBA UI; XSS / `php.deser` / `cfg-unguarded-sink` heavy. |
|
||||
| php | joomla | https://github.com/joomla/joomla-cms | `7e8527d0` | 83 | CMS; `php.deser.unserialize` and `php.path.include_variable` clusters. |
|
||||
| php | drupal | https://github.com/drupal/drupal | `92aa759e` | 635 | CMS / DI container; `cfg-unguarded-sink` (198) and `taint-prototype-pollution` (121) dominant. |
|
||||
| php | nextcloud | https://github.com/nextcloud/server | `5c0fe4c3` | 262 | File-sync platform; `cfg-resource-leak` / `state-resource-leak` heavy. |
|
||||
| java | openmrs | https://github.com/openmrs/openmrs-core | `f9c76db2` | 273 | Hibernate-heavy; JPA Criteria fix from `project_realrepo_openmrs.md` already applied. |
|
||||
| python | airflow | https://github.com/apache/airflow | `3d42610a` | 892 | Scheduler / DAG runner; `cfg-unguarded-sink` (252) and `taint-unsanitised-flow` (179) lead. |
|
||||
| python | flask | https://github.com/pallets/flask | placeholder | 0 | Smaller-surface Python framework; capture deferred. |
|
||||
| go | gin | https://github.com/gin-gonic/gin | `d3ffc998` | 20 | HTTP framework test corpus; `taint-header-injection` and TLS skip-verify in tests. |
|
||||
| rust | axum | https://github.com/tokio-rs/axum | placeholder | 0 | Not cloned in pitboss sandbox at capture time; populate locally. |
|
||||
| ruby | rails | https://github.com/rails/rails | placeholder | 0 | Capture against the `actionpack/` subtree once cloned. |
|
||||
|
||||
Captures dated `2026-05-09` (UTC). Counts are deduplicated tuples
|
||||
`(rule_id, path_suffix, line)`. Duplicate raw findings collapse on
|
||||
the diff key, so the schema-test count and diff-mode `unchanged_total`
|
||||
may differ from the `findings | length` total by a handful of
|
||||
duplicate sites. The diff key is what matters for regression
|
||||
detection.
|
||||
|
||||
### Per-lang TP/FP splits
|
||||
|
||||
Every captured finding ships with `verdict: "needs_review"` from
|
||||
`--capture`. Hand-triage is bounded but pending; none of the cross-
|
||||
language captures are sweep-labelled yet. Use the per-lang dominant
|
||||
rule_id clusters above as the priority queue:
|
||||
|
||||
- **PHP**: `cfg-unguarded-sink` and `taint-prototype-pollution` are
|
||||
the FP-dominant clusters across drupal / nextcloud / phpmyadmin
|
||||
(CMS routing + JS object construction). `php.deser.unserialize` is
|
||||
the highest-value TP cluster on joomla (17) and drupal (83). See
|
||||
`project_realrepo_joomla.md` 2026-05-03 for the magic-method
|
||||
passthrough fix that already filters one shape.
|
||||
- **Java**: `taint-unsanitised-flow` (61) and `state-resource-leak`
|
||||
(60) are openmrs's leading clusters. The JPA Criteria-API fix
|
||||
already absorbed the `cfg-unguarded-sink` cluster (216 to 24);
|
||||
remaining Hibernate / Spring resource-management FPs are the next
|
||||
triage target.
|
||||
- **Python**: `cfg-unguarded-sink` (252) on airflow is dominated by
|
||||
Airflow's scheduler / DB plumbing; `py.auth.token_override_*`
|
||||
(83) and `py.auth.missing_ownership_check` (61) are the auth-rule
|
||||
noise typical of an admin/operator codebase.
|
||||
- **Go**: gin's 20 findings are mostly test-corpus artifacts
|
||||
(`gin_test.go`, `routes_test.go`); 4 of 4 `go.transport.insecure_skip_verify`
|
||||
hits are inside `gin*_test.go` and are legitimate test setup.
|
||||
- **Rust / Ruby**: placeholder. Capture once a local clone exists.
|
||||
|
||||
### `--lang` runner usage
|
||||
|
||||
```bash
|
||||
# diff mode (default)
|
||||
scripts/validate_recall.sh --lang php drupal /Users/me/oss/drupal
|
||||
scripts/validate_recall.sh --lang java openmrs /Users/me/oss/openmrs
|
||||
|
||||
# capture / refresh
|
||||
scripts/validate_recall.sh --lang go gin /Users/me/oss/gin --capture
|
||||
```
|
||||
|
||||
Output is the same `{ added, removed, unchanged, *_total }` JSON shape
|
||||
as the JS-target diff. The diff key is `(rule_id, path_suffix, line)`.
|
||||
|
||||
### Cross-language refresh procedure
|
||||
|
||||
1. Clone or update the target into `~/oss/<target>` (or wherever).
|
||||
2. Build nyx: `cargo build --release`.
|
||||
3. Diff vs the frozen baseline:
|
||||
`scripts/validate_recall.sh --lang <lang> <target> ~/oss/<target>`.
|
||||
4. If the lift is intentional, recapture with `--capture`.
|
||||
5. Spot-check new findings; hand-label `TP`/`FP`.
|
||||
6. Commit the updated `tests/recall_targets/xlang/<lang>/<target>.json`.
|
||||
|
||||
### Sandbox-capture caveat
|
||||
|
||||
Pitboss implementer agents run sandboxed without network egress, so
|
||||
target repos that are not already present under `~/oss/` ship as
|
||||
placeholders (`pinned_commit: "unknown"`, `findings: []`). The
|
||||
current cross-language baselines cover php / java / python / go
|
||||
(every target whose repo was already cloned locally) and ship
|
||||
placeholders for `rust/axum`, `ruby/rails`, and `python/flask`. The
|
||||
schema test in `validate_real_world_targets` passes against
|
||||
placeholders because `[]` is a valid `findings` array.
|
||||
|
||||
## What lives where (quick reference)
|
||||
|
||||
- Targets list and recall-item mapping in this file.
|
||||
- Per-target JS findings under `tests/recall_targets/<target>.json`.
|
||||
- Per-target cross-lang findings under `tests/recall_targets/xlang/<lang>/<target>.json`.
|
||||
- Diff/capture runner at `scripts/validate_recall.sh` (accepts `--lang`).
|
||||
- Schema-validity test at `tests/recall_gaps.rs::validate_real_world_targets`.
|
||||
- Corpus regression baseline at `tests/recall_gaps_baseline.json`.
|
||||
- Perf records at `tests/recall_targets/perf_after.txt` (JS-target
|
||||
snapshot) and `tests/recall_targets/perf_after_xlang.txt`
|
||||
(cross-language delta).
|
||||
Loading…
Add table
Add a link
Reference in a new issue