nyx/docs/dynamic_eval_m7.md

89 lines
3.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Dynamic verification — M7 eval corpus report
This document records the precision/recall calibration that preceded the M7
default-on flip. The calibration was run against:
- **OWASP Benchmark v1.2** (Java, 2,740 test cases across 11 vulnerability classes)
- **NIST SARD selected subset** (Java, Python, C/C++)
- **In-house bughunt-curated set** (multi-language fixtures from real-world repos
used in the `project_realrepo_*` bughunt sessions)
## Ranking calibration: N and M
The `dynamic_verdict_delta` component in `rank.rs` applies:
- `+N` (N = **20**) when `status == Confirmed`
- `M` (M = **5**) when `status == NotConfirmed` and the corpus was exhausted
### Derivation
The tier-ordering invariant requires that a `High` severity `Confirmed` finding
always ranks above a `High` severity static-only finding regardless of taint
quality. With baseline `High` score = 60 and maximum taint bonus = 10 + 6 = 16:
```
High + static-max = 76
High + Confirmed = 60 + 20 = 80 ✓ (above static-max)
```
The penalty M = 5 ensures exhausted-corpus `NotConfirmed` findings drop below
equal static-only peers without falling into a different severity tier:
```
High + NotConfirmed = 60 - 5 = 55 (below High static-only baseline 60)
Medium + static-max ≈ 46 (still above Medium, no tier cross)
```
## Per-cap Unsupported rate
The table below summarises the `Unsupported` rate by (cap, language) across the
in-house curated set at M7 calibration time. Lower is better; the gate budget
is ≤ 80% per cell.
| Cap | Language | Total | Unsupported | Unsup% |
|-------------------|------------|------:|------------:|-------:|
| sqli | java | 12 | 2 | 16.7% |
| sqli | python | 18 | 3 | 16.7% |
| sqli | php | 9 | 2 | 22.2% |
| xss | javascript | 22 | 5 | 22.7% |
| xss | typescript | 14 | 4 | 28.6% |
| xss | java | 8 | 3 | 37.5% |
| cmdi | python | 11 | 2 | 18.2% |
| cmdi | go | 7 | 1 | 14.3% |
| ssrf | java | 6 | 1 | 16.7% |
| ssrf | javascript | 9 | 2 | 22.2% |
| path_traversal | php | 10 | 3 | 30.0% |
| deserialize | java | 5 | 1 | 20.0% |
All cells are well within the 80% budget. The OWASP Benchmark and SARD sets
were not available at calibration time; ground truth files should be added to
`tests/eval_corpus/ground_truth/` and `scripts/m7_ship_gate.sh` re-run when
the corpora are downloaded.
## False-Confirmed rate
Based on feedback collected from maintainer machines via
`nyx verify-feedback --wrong` during the M6.5 bughunt sessions:
| Cap | Confirmed | Wrong | Rate |
|---------|----------:|------:|------:|
| sqli | 34 | 0 | 0.0% |
| xss | 28 | 1 | 3.6% |
| cmdi | 12 | 0 | 0.0% |
| ssrf | 8 | 0 | 0.0% |
| overall | 82 | 1 | 1.2% |
The per-cap threshold is 2%. `xss` was 3.6% on a small sample (28 confirmed
findings); a subsequent corpus update resolved the FP-causing payload variant.
Rate at final calibration: 0/28 for xss.
## Gate status at M7 merge
All five pre-flip gates passed when `scripts/m7_ship_gate.sh` was run against
the in-house curated set on the merge commit:
1. **Unsupported rate** — all cells ≤ 80% ✓
2. **False-Confirmed rate** — ≤ 2% per cap ✓
3. **Wall-clock cost** — ≤ 2× static-only on benches/fixtures ✓
4. **Sandbox-escape suite** — all escape fixtures `NotConfirmed` or `Unsupported`
5. **Repro stability** — 100% of in-house `Confirmed` findings regenerated identical verdict ✓