mirror of
https://github.com/elicpeter/nyx.git
synced 2026-06-15 20:05:13 +02:00
[pitboss/grind] deferred session-0002 (20260521T143544Z-f898)
This commit is contained in:
parent
be4021d8c0
commit
b3766311fb
20 changed files with 388 additions and 664 deletions
|
|
@ -1,17 +1,10 @@
|
|||
# Phase 31: ratchet values set to the headline targets.
|
||||
# Eval corpus budget.
|
||||
#
|
||||
# These are the published acceptance numbers behind the dynamic-verification
|
||||
# overhaul (see `docs/dynamic.md` "Headline metrics"). The ratchet schedule
|
||||
# from Phase 29 collapsed into a single target row: every (cap, lang) cell is
|
||||
# now gated against the same headline thresholds. Per-cell carve-outs were
|
||||
# dropped in Phase 31; if a cell is still wider than these numbers in practice
|
||||
# it shows up as a per-cell `FAIL` in `report.py` and as a gate-1 failure in
|
||||
# `scripts/m7_ship_gate.sh`, which is the intended forcing function for the
|
||||
# remaining engine follow-ups tracked in `.pitboss/play/deferred.md`.
|
||||
# `report.py` enforces these values when `run.sh` or `run_full.sh` pass
|
||||
# `--budget`. Each (cap, lang) cell uses the default row unless a specific
|
||||
# override appears below.
|
||||
#
|
||||
# Wall-clock cost (≤ 2× static-only) is enforced separately by Gate 3 of
|
||||
# `scripts/m7_ship_gate.sh` against `benches/fixtures/`; it is not a per-cell
|
||||
# budget knob and has no entry in this file.
|
||||
# Wall-clock cost is measured separately from this per-cell budget.
|
||||
#
|
||||
# Schema:
|
||||
#
|
||||
|
|
|
|||
|
|
@ -1,23 +1,23 @@
|
|||
#!/usr/bin/env bash
|
||||
# Eval corpus runner for M7 pre-flip gate calibration.
|
||||
# Eval corpus runner.
|
||||
#
|
||||
# Usage:
|
||||
# tests/eval_corpus/run.sh [--output DIR] [--nyx BIN] [--sets owasp,sard,inhouse]
|
||||
#
|
||||
# Bootstraps OWASP Benchmark v1.2, NIST SARD subset, and in-house
|
||||
# bughunt-curated fixtures. Runs `nyx scan --verify` on each. Emits
|
||||
# Bootstraps OWASP Benchmark v1.2, the NIST SARD subset, and Nyx benchmark
|
||||
# fixtures. Runs `nyx scan --verify` on each. Emits
|
||||
# per-cell (cap x language) precision/recall table and per-cap Unsupported
|
||||
# rate to stdout (and --output DIR if given).
|
||||
#
|
||||
# Environment:
|
||||
# NYX_EVAL_CORPUS_DIR — path to pre-downloaded corpus roots
|
||||
# NYX_EVAL_CORPUS_DIR - path to pre-downloaded corpus roots
|
||||
# (default: ~/.cache/nyx/eval_corpus)
|
||||
# NYX_BIN — path to nyx binary (default: ./target/release/nyx)
|
||||
# NYX_BIN - path to nyx binary (default: ./target/release/nyx)
|
||||
#
|
||||
# Exit codes:
|
||||
# 0 — all gate thresholds met
|
||||
# 1 — setup or I/O error
|
||||
# 2 — one or more gate thresholds exceeded (see output for details)
|
||||
# 0 - all budget thresholds met
|
||||
# 1 - setup or I/O error
|
||||
# 2 - one or more budget thresholds exceeded (see output for details)
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
|
|
@ -173,9 +173,8 @@ python3 "${SCRIPT_DIR}/report.py" \
|
|||
${DIFF_FILE:+--diff "$DIFF_FILE"}
|
||||
REPORT_RC=$?
|
||||
set -e
|
||||
# Propagate gate-fail (exit 2) and malformed-config (exit 3) so the
|
||||
# m7_ship_gate.sh Gate-1 dispatch can tell them apart. Treat other
|
||||
# non-zero as setup error (exit 1).
|
||||
# Propagate budget failures (exit 2) and malformed config (exit 3). Treat other
|
||||
# non-zero exits as setup errors.
|
||||
if [[ $REPORT_RC -eq 2 ]]; then
|
||||
exit 2
|
||||
elif [[ $REPORT_RC -eq 3 ]]; then
|
||||
|
|
|
|||
|
|
@ -1,12 +1,10 @@
|
|||
#!/usr/bin/env bash
|
||||
# Phase 31: full eval-corpus orchestrator.
|
||||
# Full eval-corpus orchestrator.
|
||||
#
|
||||
# Drives a complete pass against every corpus set the project knows about
|
||||
# (OWASP Benchmark v1.2, the NIST SARD subset, and the in-house bughunt
|
||||
# fixtures), then emits a stable `tests/eval_corpus/results.json` so
|
||||
# downstream consumers (M7 ship gate, monotonic-improvement diff, the
|
||||
# headline metrics table in `docs/dynamic.md`) can read a single
|
||||
# well-known path.
|
||||
# (OWASP Benchmark v1.2, the NIST SARD subset, and the Nyx benchmark
|
||||
# fixtures), then emits `tests/eval_corpus/results.json` for reports,
|
||||
# diffs, and docs.
|
||||
#
|
||||
# Usage:
|
||||
# tests/eval_corpus/run_full.sh [--nyx BIN] [--budget FILE] [--diff FILE]
|
||||
|
|
@ -15,11 +13,9 @@
|
|||
# Differences vs `run.sh`:
|
||||
# * Always runs every set (no `--sets` selector).
|
||||
# * Always passes `--budget tests/eval_corpus/budget.toml` so the
|
||||
# headline targets (Unsupported < 20%, FalseConfirmed < 2%, Repro
|
||||
# stability >= 95%) gate every pass.
|
||||
# configured per-cell limits are checked on every pass.
|
||||
# * Copies the timestamped results file to
|
||||
# `tests/eval_corpus/results.json` (canonical path consumed by
|
||||
# `scripts/m7_ship_gate.sh` and the published metrics doc).
|
||||
# `tests/eval_corpus/results.json`.
|
||||
#
|
||||
# Exit codes:
|
||||
# 0 every set ran and the merged result met the per-cell budget.
|
||||
|
|
|
|||
|
|
@ -415,8 +415,8 @@ def main() -> int:
|
|||
elif status == "Confirmed":
|
||||
cells[key]["confirmed"] += 1
|
||||
# Repro-stability and false-Confirmed counts are optional
|
||||
# fields tabulate.py reads off the verdict when callers
|
||||
# (m7_ship_gate.sh / corpus_promote.yml) have stamped them.
|
||||
# fields tabulate.py reads off the verdict when callers have
|
||||
# stamped them.
|
||||
if dv.get("wrong") is True:
|
||||
cells[key]["wrong_confirmed"] += 1
|
||||
if dv.get("replay_stable") is True:
|
||||
|
|
|
|||
|
|
@ -1,14 +1,13 @@
|
|||
//! Phase 27 — Track H.1 integration test.
|
||||
//! Dynamic telemetry schema tests.
|
||||
//!
|
||||
//! Locks in the on-disk telemetry schema contract that `scripts/m7_ship_gate.sh`
|
||||
//! Gate 2 relies on:
|
||||
//! Locks in the on-disk telemetry schema contract:
|
||||
//!
|
||||
//! - Records produced today carry the `schema_version`, `nyx_version`, and
|
||||
//! `corpus_version` envelope fields, plus a `kind` discriminator.
|
||||
//! - `read_events(path)` accepts the current schema.
|
||||
//! - A hand-crafted record with `schema_version: 0` is rejected by
|
||||
//! `read_events` with a typed [`TelemetryReadError::SchemaMismatch`] (this
|
||||
//! is the explicit Phase 27 acceptance bullet).
|
||||
//! is the required failure mode for mixed-schema logs).
|
||||
//! - The sampling policy retains Confirmed and Inconclusive verdicts even at
|
||||
//! `sample_rate_other = 0.0`.
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue