[pitboss/grind] deferred session-0002 (20260521T143544Z-f898)

2026-06-15 20:05:13 +02:00 · 2026-05-21 11:22:13 -05:00 · 2026-05-21 11:22:13 -05:00 · b3766311fb
commit b3766311fb
parent be4021d8c0
20 changed files with 388 additions and 664 deletions
--- a/tests/eval_corpus/budget.toml
+++ b/tests/eval_corpus/budget.toml
@ -1,17 +1,10 @@
-# Phase 31: ratchet values set to the headline targets.
+# Eval corpus budget.
 #
-# These are the published acceptance numbers behind the dynamic-verification
-# overhaul (see `docs/dynamic.md` "Headline metrics").  The ratchet schedule
-# from Phase 29 collapsed into a single target row: every (cap, lang) cell is
-# now gated against the same headline thresholds.  Per-cell carve-outs were
-# dropped in Phase 31; if a cell is still wider than these numbers in practice
-# it shows up as a per-cell `FAIL` in `report.py` and as a gate-1 failure in
-# `scripts/m7_ship_gate.sh`, which is the intended forcing function for the
-# remaining engine follow-ups tracked in `.pitboss/play/deferred.md`.
+# `report.py` enforces these values when `run.sh` or `run_full.sh` pass
+# `--budget`. Each (cap, lang) cell uses the default row unless a specific
+# override appears below.
 #
-# Wall-clock cost (≤ 2× static-only) is enforced separately by Gate 3 of
-# `scripts/m7_ship_gate.sh` against `benches/fixtures/`; it is not a per-cell
-# budget knob and has no entry in this file.
+# Wall-clock cost is measured separately from this per-cell budget.
 #
 # Schema:
 #
--- a/tests/eval_corpus/run.sh
+++ b/tests/eval_corpus/run.sh
@ -1,23 +1,23 @@
 #!/usr/bin/env bash
-# Eval corpus runner for M7 pre-flip gate calibration.
+# Eval corpus runner.
 #
 # Usage:
 #   tests/eval_corpus/run.sh [--output DIR] [--nyx BIN] [--sets owasp,sard,inhouse]
 #
-# Bootstraps OWASP Benchmark v1.2, NIST SARD subset, and in-house
-# bughunt-curated fixtures. Runs `nyx scan --verify` on each. Emits
+# Bootstraps OWASP Benchmark v1.2, the NIST SARD subset, and Nyx benchmark
+# fixtures. Runs `nyx scan --verify` on each. Emits
 # per-cell (cap x language) precision/recall table and per-cap Unsupported
 # rate to stdout (and --output DIR if given).
 #
 # Environment:
-#   NYX_EVAL_CORPUS_DIR  — path to pre-downloaded corpus roots
+#   NYX_EVAL_CORPUS_DIR  - path to pre-downloaded corpus roots
 #                          (default: ~/.cache/nyx/eval_corpus)
-#   NYX_BIN              — path to nyx binary (default: ./target/release/nyx)
+#   NYX_BIN              - path to nyx binary (default: ./target/release/nyx)
 #
 # Exit codes:
-#   0 — all gate thresholds met
-#   1 — setup or I/O error
-#   2 — one or more gate thresholds exceeded (see output for details)
+#   0 - all budget thresholds met
+#   1 - setup or I/O error
+#   2 - one or more budget thresholds exceeded (see output for details)

 set -euo pipefail

@ -173,9 +173,8 @@ python3 "${SCRIPT_DIR}/report.py" \
  ${DIFF_FILE:+--diff "$DIFF_FILE"}
 REPORT_RC=$?
 set -e
-# Propagate gate-fail (exit 2) and malformed-config (exit 3) so the
-# m7_ship_gate.sh Gate-1 dispatch can tell them apart.  Treat other
-# non-zero as setup error (exit 1).
+# Propagate budget failures (exit 2) and malformed config (exit 3). Treat other
+# non-zero exits as setup errors.
 if [[ $REPORT_RC -eq 2 ]]; then
  exit 2
 elif [[ $REPORT_RC -eq 3 ]]; then
--- a/tests/eval_corpus/run_full.sh
+++ b/tests/eval_corpus/run_full.sh
@ -1,12 +1,10 @@
 #!/usr/bin/env bash
-# Phase 31: full eval-corpus orchestrator.
+# Full eval-corpus orchestrator.
 #
 # Drives a complete pass against every corpus set the project knows about
-# (OWASP Benchmark v1.2, the NIST SARD subset, and the in-house bughunt
-# fixtures), then emits a stable `tests/eval_corpus/results.json` so
-# downstream consumers (M7 ship gate, monotonic-improvement diff, the
-# headline metrics table in `docs/dynamic.md`) can read a single
-# well-known path.
+# (OWASP Benchmark v1.2, the NIST SARD subset, and the Nyx benchmark
+# fixtures), then emits `tests/eval_corpus/results.json` for reports,
+# diffs, and docs.
 #
 # Usage:
 #   tests/eval_corpus/run_full.sh [--nyx BIN] [--budget FILE] [--diff FILE]
@ -15,11 +13,9 @@
 # Differences vs `run.sh`:
 #   * Always runs every set (no `--sets` selector).
 #   * Always passes `--budget tests/eval_corpus/budget.toml` so the
-#     headline targets (Unsupported < 20%, FalseConfirmed < 2%, Repro
-#     stability >= 95%) gate every pass.
+#     configured per-cell limits are checked on every pass.
 #   * Copies the timestamped results file to
-#     `tests/eval_corpus/results.json` (canonical path consumed by
-#     `scripts/m7_ship_gate.sh` and the published metrics doc).
+#     `tests/eval_corpus/results.json`.
 #
 # Exit codes:
 #   0  every set ran and the merged result met the per-cell budget.
--- a/tests/eval_corpus/tabulate.py
+++ b/tests/eval_corpus/tabulate.py
@ -415,8 +415,8 @@ def main() -> int:
            elif status == "Confirmed":
                cells[key]["confirmed"] += 1
                # Repro-stability and false-Confirmed counts are optional
-                # fields tabulate.py reads off the verdict when callers
-                # (m7_ship_gate.sh / corpus_promote.yml) have stamped them.
+                # fields tabulate.py reads off the verdict when callers have
+                # stamped them.
                if dv.get("wrong") is True:
                    cells[key]["wrong_confirmed"] += 1
                if dv.get("replay_stable") is True:
--- a/tests/telemetry_schema.rs
+++ b/tests/telemetry_schema.rs
@ -1,14 +1,13 @@
-//! Phase 27 — Track H.1 integration test.
+//! Dynamic telemetry schema tests.
 //!
-//! Locks in the on-disk telemetry schema contract that `scripts/m7_ship_gate.sh`
-//! Gate 2 relies on:
+//! Locks in the on-disk telemetry schema contract:
 //!
 //! - Records produced today carry the `schema_version`, `nyx_version`, and
 //!   `corpus_version` envelope fields, plus a `kind` discriminator.
 //! - `read_events(path)` accepts the current schema.
 //! - A hand-crafted record with `schema_version: 0` is rejected by
 //!   `read_events` with a typed [`TelemetryReadError::SchemaMismatch`] (this
-//!   is the explicit Phase 27 acceptance bullet).
+//!   is the required failure mode for mixed-schema logs).
 //! - The sampling policy retains Confirmed and Inconclusive verdicts even at
 //!   `sample_rate_other = 0.0`.