[pitboss/grind] deferred session-0002 (20260521T143544Z-f898)

This commit is contained in:
pitboss 2026-05-21 11:22:13 -05:00
parent be4021d8c0
commit b3766311fb
20 changed files with 388 additions and 664 deletions

View file

@ -1,17 +1,10 @@
# Phase 31: ratchet values set to the headline targets.
# Eval corpus budget.
#
# These are the published acceptance numbers behind the dynamic-verification
# overhaul (see `docs/dynamic.md` "Headline metrics"). The ratchet schedule
# from Phase 29 collapsed into a single target row: every (cap, lang) cell is
# now gated against the same headline thresholds. Per-cell carve-outs were
# dropped in Phase 31; if a cell is still wider than these numbers in practice
# it shows up as a per-cell `FAIL` in `report.py` and as a gate-1 failure in
# `scripts/m7_ship_gate.sh`, which is the intended forcing function for the
# remaining engine follow-ups tracked in `.pitboss/play/deferred.md`.
# `report.py` enforces these values when `run.sh` or `run_full.sh` pass
# `--budget`. Each (cap, lang) cell uses the default row unless a specific
# override appears below.
#
# Wall-clock cost (≤ 2× static-only) is enforced separately by Gate 3 of
# `scripts/m7_ship_gate.sh` against `benches/fixtures/`; it is not a per-cell
# budget knob and has no entry in this file.
# Wall-clock cost is measured separately from this per-cell budget.
#
# Schema:
#

View file

@ -1,23 +1,23 @@
#!/usr/bin/env bash
# Eval corpus runner for M7 pre-flip gate calibration.
# Eval corpus runner.
#
# Usage:
# tests/eval_corpus/run.sh [--output DIR] [--nyx BIN] [--sets owasp,sard,inhouse]
#
# Bootstraps OWASP Benchmark v1.2, NIST SARD subset, and in-house
# bughunt-curated fixtures. Runs `nyx scan --verify` on each. Emits
# Bootstraps OWASP Benchmark v1.2, the NIST SARD subset, and Nyx benchmark
# fixtures. Runs `nyx scan --verify` on each. Emits
# per-cell (cap x language) precision/recall table and per-cap Unsupported
# rate to stdout (and --output DIR if given).
#
# Environment:
# NYX_EVAL_CORPUS_DIR path to pre-downloaded corpus roots
# NYX_EVAL_CORPUS_DIR - path to pre-downloaded corpus roots
# (default: ~/.cache/nyx/eval_corpus)
# NYX_BIN path to nyx binary (default: ./target/release/nyx)
# NYX_BIN - path to nyx binary (default: ./target/release/nyx)
#
# Exit codes:
# 0 — all gate thresholds met
# 1 setup or I/O error
# 2 — one or more gate thresholds exceeded (see output for details)
# 0 - all budget thresholds met
# 1 - setup or I/O error
# 2 - one or more budget thresholds exceeded (see output for details)
set -euo pipefail
@ -173,9 +173,8 @@ python3 "${SCRIPT_DIR}/report.py" \
${DIFF_FILE:+--diff "$DIFF_FILE"}
REPORT_RC=$?
set -e
# Propagate gate-fail (exit 2) and malformed-config (exit 3) so the
# m7_ship_gate.sh Gate-1 dispatch can tell them apart. Treat other
# non-zero as setup error (exit 1).
# Propagate budget failures (exit 2) and malformed config (exit 3). Treat other
# non-zero exits as setup errors.
if [[ $REPORT_RC -eq 2 ]]; then
exit 2
elif [[ $REPORT_RC -eq 3 ]]; then

View file

@ -1,12 +1,10 @@
#!/usr/bin/env bash
# Phase 31: full eval-corpus orchestrator.
# Full eval-corpus orchestrator.
#
# Drives a complete pass against every corpus set the project knows about
# (OWASP Benchmark v1.2, the NIST SARD subset, and the in-house bughunt
# fixtures), then emits a stable `tests/eval_corpus/results.json` so
# downstream consumers (M7 ship gate, monotonic-improvement diff, the
# headline metrics table in `docs/dynamic.md`) can read a single
# well-known path.
# (OWASP Benchmark v1.2, the NIST SARD subset, and the Nyx benchmark
# fixtures), then emits `tests/eval_corpus/results.json` for reports,
# diffs, and docs.
#
# Usage:
# tests/eval_corpus/run_full.sh [--nyx BIN] [--budget FILE] [--diff FILE]
@ -15,11 +13,9 @@
# Differences vs `run.sh`:
# * Always runs every set (no `--sets` selector).
# * Always passes `--budget tests/eval_corpus/budget.toml` so the
# headline targets (Unsupported < 20%, FalseConfirmed < 2%, Repro
# stability >= 95%) gate every pass.
# configured per-cell limits are checked on every pass.
# * Copies the timestamped results file to
# `tests/eval_corpus/results.json` (canonical path consumed by
# `scripts/m7_ship_gate.sh` and the published metrics doc).
# `tests/eval_corpus/results.json`.
#
# Exit codes:
# 0 every set ran and the merged result met the per-cell budget.

View file

@ -415,8 +415,8 @@ def main() -> int:
elif status == "Confirmed":
cells[key]["confirmed"] += 1
# Repro-stability and false-Confirmed counts are optional
# fields tabulate.py reads off the verdict when callers
# (m7_ship_gate.sh / corpus_promote.yml) have stamped them.
# fields tabulate.py reads off the verdict when callers have
# stamped them.
if dv.get("wrong") is True:
cells[key]["wrong_confirmed"] += 1
if dv.get("replay_stable") is True:

View file

@ -1,14 +1,13 @@
//! Phase 27 — Track H.1 integration test.
//! Dynamic telemetry schema tests.
//!
//! Locks in the on-disk telemetry schema contract that `scripts/m7_ship_gate.sh`
//! Gate 2 relies on:
//! Locks in the on-disk telemetry schema contract:
//!
//! - Records produced today carry the `schema_version`, `nyx_version`, and
//! `corpus_version` envelope fields, plus a `kind` discriminator.
//! - `read_events(path)` accepts the current schema.
//! - A hand-crafted record with `schema_version: 0` is rejected by
//! `read_events` with a typed [`TelemetryReadError::SchemaMismatch`] (this
//! is the explicit Phase 27 acceptance bullet).
//! is the required failure mode for mixed-schema logs).
//! - The sampling policy retains Confirmed and Inconclusive verdicts even at
//! `sample_rate_other = 0.0`.