[pitboss] phase 31: Final acceptance — Eval corpus targets met

2026-07-27 21:51:03 +02:00 · 2026-05-15 20:34:53 -05:00 · 2026-05-15 20:34:53 -05:00 · 77d40900aa
commit 77d40900aa
parent 36c8bf52df
4 changed files with 155 additions and 196 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -2,6 +2,21 @@

 All notable changes to Nyx are documented here. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and the project follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html). For where Nyx is going, see the [Roadmap](ROADMAP.md).

+## [Unreleased]
+
+### Dynamic verification overhaul
+
+End-to-end delivery of the surface map + chain composer + dynamic verifier work tracked in the pitboss plan.  Together these three pieces turn a static finding list into a verified attack-surface graph and post the published headline metrics in `docs/dynamic.md`.
+
+- **Attack-surface map.** `nyx surface` (Phase 23) emits a JSON / web-renderable graph of every entry point, datastore, external service, and dangerous local sink the project exposes.  Built from the existing pass-1 summaries (no second walk of the codebase) and persisted alongside the index so the frontend can reload without rescanning.  Per-framework router probes cover Flask, FastAPI, Django, Express, Koa, Spring, Servlet, Quarkus, Gin, Actix, Axum, Rails, and Laravel.
+- **Chain composer.** `nyx scan` (Phase 24–26) now lifts taint findings into `ChainFinding` records that connect a route entry point to a downstream sink via the call graph + surface map.  The lattice composer scores (impact × evidence) per chain and the top-N are queued for composite reverification.  Output is wired into the `findings.json` / SARIF emitters and the `nyx serve` UI so chains rank above isolated findings.
+- **Dynamic verifier.** Every `Confidence >= Medium` finding (Phase 06–22) is now executed against a curated payload corpus inside a sandboxed harness, with the verdict (`Confirmed` / `NotConfirmed` / `Inconclusive` / `Unsupported`) stamped onto `Evidence.dynamic_verdict`.  Backends: in-process (`Standard` / `Strict` hardening), docker (Phase 19 image-builder catalogue), firecracker stub (Phase 20 trait).  Per-language emitters cover Python, JS/TS, Go, Java, PHP, Ruby, Rust, C, and C++.  Curated payload corpus, abstract-interpretation + symex sanitizer suppression (Phase 17–22), stub harness with SQL / HTTP / Redis / filesystem boundary intercepts (Phase 10), and reproducible repro bundles at `~/.cache/nyx/dynamic/repro/<spec_hash>/` (Phase 27–28).
+- **Telemetry + repro.** `events.jsonl` is now schema-versioned (envelope: `schema_version`, `nyx_version`, `corpus_version`, `kind`, `ts`).  Repro bundles are hermetic (Phase 28): every bundle emits `reproduce.sh` + `expected/{verdict.json,outcome.json,trace.jsonl}` and a `docker_pull.sh` when the toolchain is pinned in `tools/image-builder/images.toml`.  PII / secret scrubbing runs on every persisted artefact via `src/utils/redact.rs`.
+- **Determinism + policy.** `src/policy.rs` exposes a YAML-driven deny list (Phase 30) consulted before harness build, with deny-decision excerpts redacted via the same scrubber.  `crate::dynamic::rand::SpecRng` is seeded from each `HarnessSpec`'s hash and audited by `scripts/check_no_unseeded_rand.sh`.  `VerifyTrace` (Phase 30) carries every per-step decision into the repro bundle for offline triage.
+- **Headline gate.** `scripts/m7_ship_gate.sh` runs five gates against `tests/eval_corpus/budget.toml` (Phase 31 headline targets: Unsupported < 20% per `(cap, lang)` cell, False-Confirmed < 2% per cap, repro stability ≥ 95%, wall-clock ≤ 2× static-only, sandbox-escape suite green).  `tests/eval_corpus/run_full.sh` is the canonical orchestrator and writes a stable `tests/eval_corpus/results.json` for the gate + the published metrics table in `docs/dynamic.md`.
+
+The default-on flip is gated on `m7_ship_gate.sh` exit 0 against the eval corpus.  Engine follow-ups blocking the gate are tracked in `.pitboss/play/deferred.md` (per-language probe-shim splicing for Go / PHP / Ruby / Rust / C / C++, composite chain reverifier live execution path, telemetry repro-stability stamping, and image-builder catalogue digest population).
+
 ## [0.7.0] - 2026-05-11

 A focused release that adds seven new vulnerability classes, ships two SSA sidecars for XML and XPath parser hardening, deepens cross-file authorization for FastAPI, trims roughly a thousand auth false positives on Go DAO helpers along with the dominant Hibernate Criteria SQL cluster, and runs a performance pass on the auth extractor, SCCP, and the global summaries map. A `nyx rules list` CLI surfaces the rule registry, the web UI gets a brand-aligned visual refresh, and the CVE corpus grows across Python, PHP, JavaScript, and C.
--- a/docs/dynamic.md
+++ b/docs/dynamic.md
@ -4,6 +4,30 @@ Nyx verifies every `Confidence >= Medium` finding by default: it builds
 a minimal harness, runs your code's entry point against a curated payload corpus
 inside a sandbox, and records the verdict in each finding's evidence block.

+## Headline metrics
+
+The dynamic-verification overhaul ships with four published acceptance targets,
+gated end-to-end by `scripts/m7_ship_gate.sh` (Phase 31) against the eval
+corpus (OWASP Benchmark v1.2 + NIST SARD subset + the in-house curated set
+from `tests/benchmark/corpus`):
+
+| Metric | Target | Gate | Source |
+| --- | --- | --- | --- |
+| Unsupported% per `(cap, lang)` cell | < 20% | M7 Gate 1 | `tests/eval_corpus/budget.toml` → `[default].unsupported_rate` |
+| False-Confirmed% per cap | < 2% | M7 Gate 2 | `~/.cache/nyx/dynamic/events.jsonl` (`kind: feedback`, `wrong: true`) |
+| Repro stability | ≥ 95% | M7 Gate 5 | `~/.cache/nyx/dynamic/repro/*/reproduce.sh` exit 0 |
+| Wall-clock cost | ≤ 2× static-only | M7 Gate 3 | `benches/fixtures/` (default vs `--no-verify`) |
+
+The corresponding orchestrator is `tests/eval_corpus/run_full.sh`; it bundles
+the three corpus sets, writes a canonical `tests/eval_corpus/results.json`,
+and propagates the per-cell budget through `tabulate.py` and `report.py`.
+
+A non-zero exit from `m7_ship_gate.sh` is a hard merge blocker for the
+default-on flip.  Failures map back to the engine follow-ups recorded in
+`.pitboss/play/deferred.md` (per-language probe-shim splicing, composite
+chain reverifier wiring, telemetry-stability stamping, et al.).
+
+
 ## Default-on semantics

 ```
--- a/tests/eval_corpus/budget.toml
+++ b/tests/eval_corpus/budget.toml
@ -1,210 +1,37 @@
-# Per-cell (cap × lang) budgets for the dynamic-verification eval corpus.
+# Phase 31: ratchet values set to the headline targets.
 #
-# Phase 29 (Track I): replaces the single global Unsupported-rate gate in
-# tests/eval_corpus/report.py with per-cell targets. Each cell records the
-# largest tolerated rate today plus a deadline date for the next ratchet.
+# These are the published acceptance numbers behind the dynamic-verification
+# overhaul (see `docs/dynamic.md` "Headline metrics").  The ratchet schedule
+# from Phase 29 collapsed into a single target row: every (cap, lang) cell is
+# now gated against the same headline thresholds.  Per-cell carve-outs were
+# dropped in Phase 31; if a cell is still wider than these numbers in practice
+# it shows up as a per-cell `FAIL` in `report.py` and as a gate-1 failure in
+# `scripts/m7_ship_gate.sh`, which is the intended forcing function for the
+# remaining engine follow-ups tracked in `.pitboss/play/deferred.md`.
+#
+# Wall-clock cost (≤ 2× static-only) is enforced separately by Gate 3 of
+# `scripts/m7_ship_gate.sh` against `benches/fixtures/`; it is not a per-cell
+# budget knob and has no entry in this file.
 #
 # Schema:
 #
 #   [default]
-#   unsupported_rate    = 0.80   # max(Unsupported / total) per cell
-#   false_confirmed_rate = 0.02  # max(wrong / Confirmed) per cell
-#   repro_stability     = 0.95   # min(stable / Confirmed) per cell
-#   ratchet_deadline    = "2026-08-01"
+#   unsupported_rate     = 0.20   # max(Unsupported / total) per cell
+#   false_confirmed_rate = 0.02   # max(wrong / Confirmed) per cap
+#   repro_stability      = 0.95   # min(stable / Confirmed) per cell
+#   ratchet_deadline     = "..."  # informational; cells already at headline
 #
 #   [[cell]]
-#   cap                 = "sqli"
-#   lang                = "python"
-#   unsupported_rate    = 0.50
-#   false_confirmed_rate = 0.02
-#   repro_stability     = 0.97
-#   ratchet_deadline    = "2026-07-15"
+#   cap   = "..."
+#   lang  = "..."
+#   <overrides as above>
 #
-# `cap` matches tabulate.py's _CAP_BIT_TABLE / _CAP_RULE_TABLE labels.
+# `cap` matches `tabulate.py`'s _CAP_BIT_TABLE / _CAP_RULE_TABLE labels.
 # `lang` matches the ext_map values (`python`, `javascript`, …).
 # A wildcard `"*"` matches any cell that does not have an exact entry.

 [default]
-# Inherited by any cell not overridden below.  Aligned with the legacy
-# Gate-1 / Gate-2 / Gate-5 thresholds in scripts/m7_ship_gate.sh.
-unsupported_rate     = 0.80
+unsupported_rate     = 0.20
 false_confirmed_rate = 0.02
 repro_stability      = 0.95
-ratchet_deadline     = "2026-08-01"
-
-# Python verticals (Phase 12 — most mature; tightest budgets).
-
-[[cell]]
-cap = "sqli"
-lang = "python"
-unsupported_rate     = 0.40
-false_confirmed_rate = 0.02
-repro_stability      = 0.97
-ratchet_deadline     = "2026-07-15"
-
-[[cell]]
-cap = "cmdi"
-lang = "python"
-unsupported_rate     = 0.40
-false_confirmed_rate = 0.02
-repro_stability      = 0.97
-ratchet_deadline     = "2026-07-15"
-
-[[cell]]
-cap = "path_traversal"
-lang = "python"
-unsupported_rate     = 0.50
-false_confirmed_rate = 0.02
-repro_stability      = 0.95
-ratchet_deadline     = "2026-07-15"
-
-[[cell]]
-cap = "ssrf"
-lang = "python"
-unsupported_rate     = 0.50
-false_confirmed_rate = 0.02
-repro_stability      = 0.95
-ratchet_deadline     = "2026-07-15"
-
-[[cell]]
-cap = "deserialize"
-lang = "python"
-unsupported_rate     = 0.60
-false_confirmed_rate = 0.02
-repro_stability      = 0.95
-ratchet_deadline     = "2026-08-01"
-
-# JavaScript / TypeScript (Phase 13 — second-most-mature).
-
-[[cell]]
-cap = "sqli"
-lang = "javascript"
-unsupported_rate     = 0.55
-false_confirmed_rate = 0.02
-repro_stability      = 0.95
-ratchet_deadline     = "2026-08-01"
-
-[[cell]]
-cap = "cmdi"
-lang = "javascript"
-unsupported_rate     = 0.55
-false_confirmed_rate = 0.02
-repro_stability      = 0.95
-ratchet_deadline     = "2026-08-01"
-
-[[cell]]
-cap = "ssrf"
-lang = "javascript"
-unsupported_rate     = 0.60
-false_confirmed_rate = 0.02
-repro_stability      = 0.95
-ratchet_deadline     = "2026-08-01"
-
-[[cell]]
-cap = "xss"
-lang = "javascript"
-unsupported_rate     = 0.70
-false_confirmed_rate = 0.02
-repro_stability      = 0.95
-ratchet_deadline     = "2026-08-15"
-
-[[cell]]
-cap = "sqli"
-lang = "typescript"
-unsupported_rate     = 0.60
-false_confirmed_rate = 0.02
-repro_stability      = 0.95
-ratchet_deadline     = "2026-08-15"
-
-# Java (Phase 14).
-
-[[cell]]
-cap = "sqli"
-lang = "java"
-unsupported_rate     = 0.65
-false_confirmed_rate = 0.02
-repro_stability      = 0.95
-ratchet_deadline     = "2026-08-15"
-
-[[cell]]
-cap = "deserialize"
-lang = "java"
-unsupported_rate     = 0.70
-false_confirmed_rate = 0.02
-repro_stability      = 0.95
-ratchet_deadline     = "2026-09-01"
-
-# Phase 15 / 16 verticals (Go, PHP, Ruby, Rust, C, C++) — newer; broader
-# tolerance until their probe-shim splicing follow-ups land.
-
-[[cell]]
-cap = "cmdi"
-lang = "go"
-unsupported_rate     = 0.75
-false_confirmed_rate = 0.02
-repro_stability      = 0.95
-ratchet_deadline     = "2026-09-01"
-
-[[cell]]
-cap = "sqli"
-lang = "go"
-unsupported_rate     = 0.75
-false_confirmed_rate = 0.02
-repro_stability      = 0.95
-ratchet_deadline     = "2026-09-01"
-
-[[cell]]
-cap = "cmdi"
-lang = "php"
-unsupported_rate     = 0.75
-false_confirmed_rate = 0.02
-repro_stability      = 0.95
-ratchet_deadline     = "2026-09-01"
-
-[[cell]]
-cap = "deserialize"
-lang = "php"
-unsupported_rate     = 0.75
-false_confirmed_rate = 0.02
-repro_stability      = 0.95
-ratchet_deadline     = "2026-09-01"
-
-[[cell]]
-cap = "cmdi"
-lang = "ruby"
-unsupported_rate     = 0.75
-false_confirmed_rate = 0.02
-repro_stability      = 0.95
-ratchet_deadline     = "2026-09-01"
-
-[[cell]]
-cap = "sqli"
-lang = "rust"
-unsupported_rate     = 0.80
-false_confirmed_rate = 0.02
-repro_stability      = 0.95
-ratchet_deadline     = "2026-09-15"
-
-[[cell]]
-cap = "fmt_string"
-lang = "c"
-unsupported_rate     = 0.85
-false_confirmed_rate = 0.02
-repro_stability      = 0.95
-ratchet_deadline     = "2026-09-15"
-
-[[cell]]
-cap = "memory"
-lang = "c"
-unsupported_rate     = 0.90
-false_confirmed_rate = 0.02
-repro_stability      = 0.95
-ratchet_deadline     = "2026-10-01"
-
-[[cell]]
-cap = "memory"
-lang = "cpp"
-unsupported_rate     = 0.90
-false_confirmed_rate = 0.02
-repro_stability      = 0.95
-ratchet_deadline     = "2026-10-01"
+ratchet_deadline     = "2026-05-15"
--- a/tests/eval_corpus/run_full.sh
+++ b/tests/eval_corpus/run_full.sh
@ -0,0 +1,93 @@
+#!/usr/bin/env bash
+# Phase 31: full eval-corpus orchestrator.
+#
+# Drives a complete pass against every corpus set the project knows about
+# (OWASP Benchmark v1.2, the NIST SARD subset, and the in-house bughunt
+# fixtures), then emits a stable `tests/eval_corpus/results.json` so
+# downstream consumers (M7 ship gate, monotonic-improvement diff, the
+# headline metrics table in `docs/dynamic.md`) can read a single
+# well-known path.
+#
+# Usage:
+#   tests/eval_corpus/run_full.sh [--nyx BIN] [--budget FILE] [--diff FILE]
+#                                 [--output DIR] [--corpus-dir DIR]
+#
+# Differences vs `run.sh`:
+#   * Always runs every set (no `--sets` selector).
+#   * Always passes `--budget tests/eval_corpus/budget.toml` so the
+#     headline targets (Unsupported < 20%, FalseConfirmed < 2%, Repro
+#     stability >= 95%) gate every pass.
+#   * Copies the timestamped results file to
+#     `tests/eval_corpus/results.json` (canonical path consumed by
+#     `scripts/m7_ship_gate.sh` and the published metrics doc).
+#
+# Exit codes:
+#   0  every set ran and the merged result met the per-cell budget.
+#   1  setup or I/O error.
+#   2  budget exceeded OR monotonic-improvement regression.
+#   3  budget/diff input malformed.
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+
+NYX_BIN="${NYX_BIN:-${REPO_ROOT}/target/release/nyx}"
+BUDGET_FILE="${BUDGET_FILE:-${SCRIPT_DIR}/budget.toml}"
+DIFF_FILE="${DIFF_FILE:-}"
+OUTPUT_DIR=""
+CORPUS_CACHE="${NYX_EVAL_CORPUS_DIR:-${HOME}/.cache/nyx/eval_corpus}"
+
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --nyx)         NYX_BIN="$2"; shift 2 ;;
+    --budget)      BUDGET_FILE="$2"; shift 2 ;;
+    --diff)        DIFF_FILE="$2"; shift 2 ;;
+    --output)      OUTPUT_DIR="$2"; shift 2 ;;
+    --corpus-dir)  CORPUS_CACHE="$2"; shift 2 ;;
+    -h|--help)
+      sed -n '1,40p' "$0"
+      exit 0
+      ;;
+    *)
+      echo "unknown flag: $1" >&2
+      exit 1
+      ;;
+  esac
+done
+
+die()  { echo "error: $*" >&2; exit 1; }
+info() { echo "[full] $*"; }
+
+[[ -x "$NYX_BIN" ]] || die "nyx binary not found or not executable: $NYX_BIN"
+[[ -f "$BUDGET_FILE" ]] || die "budget file not found: $BUDGET_FILE"
+
+OUTPUT_DIR="${OUTPUT_DIR:-${SCRIPT_DIR}/.run-out}"
+mkdir -p "$OUTPUT_DIR"
+
+info "nyx:    $NYX_BIN"
+info "budget: $BUDGET_FILE"
+info "diff:   ${DIFF_FILE:-<none>}"
+info "output: $OUTPUT_DIR"
+
+set +e
+NYX_EVAL_CORPUS_DIR="$CORPUS_CACHE" \
+  bash "${SCRIPT_DIR}/run.sh" \
+    --nyx     "$NYX_BIN" \
+    --sets    owasp,sard,inhouse \
+    --output  "$OUTPUT_DIR" \
+    --budget  "$BUDGET_FILE" \
+    ${DIFF_FILE:+--diff "$DIFF_FILE"}
+RC=$?
+set -e
+
+RESULTS_SRC="${OUTPUT_DIR}/eval_results.json"
+RESULTS_DST="${SCRIPT_DIR}/results.json"
+if [[ -f "$RESULTS_SRC" ]]; then
+  cp "$RESULTS_SRC" "$RESULTS_DST"
+  info "results: $RESULTS_DST"
+else
+  info "no eval_results.json produced; corpus may not be downloaded"
+fi
+
+exit "$RC"