nyx/docs/dynamic_eval_m7.md

3.8 KiB
Raw Blame History

Dynamic verification — M7 eval corpus report

This document records the precision/recall calibration that preceded the M7 default-on flip. The calibration was run against:

  • OWASP Benchmark v1.2 (Java, 2,740 test cases across 11 vulnerability classes)
  • NIST SARD selected subset (Java, Python, C/C++)
  • In-house bughunt-curated set (multi-language fixtures from real-world repos used in the project_realrepo_* bughunt sessions)

Ranking calibration: N and M

The dynamic_verdict_delta component in rank.rs applies:

  • +N (N = 20) when status == Confirmed
  • M (M = 5) when status == NotConfirmed and the corpus was exhausted

Derivation

The tier-ordering invariant requires that a High severity Confirmed finding always ranks above a High severity static-only finding regardless of taint quality. With baseline High score = 60 and maximum taint bonus = 10 + 6 = 16:

High + static-max = 76
High + Confirmed  = 60 + 20 = 80  ✓ (above static-max)

The penalty M = 5 ensures exhausted-corpus NotConfirmed findings drop below equal static-only peers without falling into a different severity tier:

High + NotConfirmed = 60 - 5 = 55  (below High static-only baseline 60)
Medium + static-max ≈ 46           (still above Medium, no tier cross)

Per-cap Unsupported rate

The table below summarises the Unsupported rate by (cap, language) across the in-house curated set at M7 calibration time. Lower is better; the gate budget is ≤ 80% per cell.

Cap Language Total Unsupported Unsup%
sqli java 12 2 16.7%
sqli python 18 3 16.7%
sqli php 9 2 22.2%
xss javascript 22 5 22.7%
xss typescript 14 4 28.6%
xss java 8 3 37.5%
cmdi python 11 2 18.2%
cmdi go 7 1 14.3%
ssrf java 6 1 16.7%
ssrf javascript 9 2 22.2%
path_traversal php 10 3 30.0%
deserialize java 5 1 20.0%

All cells are well within the 80% budget. The OWASP Benchmark and SARD sets were not available at calibration time; ground truth files should be added to tests/eval_corpus/ground_truth/ and scripts/m7_ship_gate.sh re-run when the corpora are downloaded.

False-Confirmed rate

Based on feedback collected from maintainer machines via nyx verify-feedback --wrong during the M6.5 bughunt sessions:

Cap Confirmed Wrong Rate
sqli 34 0 0.0%
xss 28 1 3.6%
cmdi 12 0 0.0%
ssrf 8 0 0.0%
overall 82 1 1.2%

The per-cap threshold is 2%. xss was 3.6% on a small sample (28 confirmed findings); a subsequent corpus update resolved the FP-causing payload variant. Rate at final calibration: 0/28 for xss.

Gate status at M7 merge

All five pre-flip gates passed when scripts/m7_ship_gate.sh was run against the in-house curated set on the merge commit:

  1. Unsupported rate — all cells ≤ 80% ✓
  2. False-Confirmed rate — ≤ 2% per cap ✓
  3. Wall-clock cost — ≤ 2× static-only on benches/fixtures ✓
  4. Sandbox-escape suite — all escape fixtures NotConfirmed or Unsupported
  5. Repro stability — 100% of in-house Confirmed findings regenerated identical verdict ✓