# Dynamic verification Nyx verifies every `Confidence >= Medium` finding by default: it builds a minimal harness, runs your code's entry point against a curated payload corpus inside a sandbox, and records the verdict in each finding's evidence block. ## Headline metrics The dynamic-verification overhaul ships with four published acceptance targets, gated end-to-end by `scripts/m7_ship_gate.sh` (Phase 31) against the eval corpus (OWASP Benchmark v1.2 + NIST SARD subset + the in-house curated set from `tests/benchmark/corpus`): | Metric | Target | Gate | Source | | --- | --- | --- | --- | | Unsupported% per `(cap, lang)` cell | < 20% | M7 Gate 1 | `tests/eval_corpus/budget.toml` → `[default].unsupported_rate` | | False-Confirmed% per cap | < 2% | M7 Gate 2 | `~/.cache/nyx/dynamic/events.jsonl` (`kind: feedback`, `wrong: true`) | | Repro stability | ≥ 95% | M7 Gate 5 | `~/.cache/nyx/dynamic/repro/*/reproduce.sh` exit 0 | | Wall-clock cost | ≤ 2× static-only | M7 Gate 3 | `benches/fixtures/` (default vs `--no-verify`) | The corresponding orchestrator is `tests/eval_corpus/run_full.sh`; it bundles the three corpus sets, writes a canonical `tests/eval_corpus/results.json`, and propagates the per-cell budget through `tabulate.py` and `report.py`. A non-zero exit from `m7_ship_gate.sh` is a hard merge blocker for the default-on flip. Failures map back to the engine follow-ups recorded in `.pitboss/play/deferred.md` (per-language probe-shim splicing, composite chain reverifier wiring, telemetry-stability stamping, et al.). ## Default-on semantics ``` nyx scan # verifies Medium+ findings (default) nyx scan --no-verify # static analysis only, no harness execution nyx scan --verify # same as default; explicit for clarity in scripts ``` `--no-verify` is the escape hatch. It overrides the config default for a single run without changing `nyx.toml`. ### What "verified" means A finding with `dynamic_verdict.status: Confirmed` was successfully triggered by at least one payload in nyx's corpus. The corpus covers common patterns for each vulnerability class (SQL injection, XSS, command injection, SSRF, etc.) per language. A finding with `dynamic_verdict.status: NotConfirmed` was attempted but no payload fired. This is not a false-positive signal — it means the corpus did not have a payload that matched the specific sink variant or the execution path was not reachable in the test harness. A finding with `dynamic_verdict.status: Unsupported` could not be attempted. Common reasons: confidence below threshold, no flow steps, language or sink type not yet supported by the harness layer. ### Confidence gate Only `Confidence >= Medium` findings are verified by default (§5.1). To also verify low-confidence findings — for corpus building or backfill — pass `--verify-all-confidence`: ``` nyx scan --verify-all-confidence ``` This is not recommended for production scans because low-confidence findings have a higher false-positive rate and the harness may produce unreliable verdicts. ## nyx.toml opt-out If you want static-only scans permanently, set `verify = false` in `nyx.toml`: ```toml [scanner] verify = false ``` This survives upgrades — the M7 default flip only changes the inherited default for projects that have not explicitly set the field. ## Sandbox backends nyx uses docker when available, then falls back to an in-process runner: ``` nyx scan --backend docker # require docker; fail if unavailable nyx scan --backend process # in-process runner (no container; less isolation) nyx scan --unsafe-sandbox # alias for --backend process ``` The docker backend mounts only the entry file's directory and blocks all outbound network by default. When out-of-band detection is enabled (`oob_listener` in config), the container gets `--network bridge` with a host-gateway route. ## Repro artifacts When a finding is `Confirmed`, nyx writes a repro artifact to `~/.cache/nyx/repro//`. The artifact contains the harness spec and the triggering payload. You can regenerate the verdict with: ``` nyx scan --verify # re-scans and re-verifies ``` See `docs/output.md` for the `dynamic_verdict` field schema. ## Wall-clock cost Verification adds harness build + sandbox startup time per finding. On typical codebases with 10–50 Medium+ findings, end-to-end overhead is 2–5× static-only. If scan time is unacceptable for a given workflow (e.g. IDE integration, quick pre-commit check), use `--no-verify` for that workflow and rely on the full scan in CI. ## Event schema The dynamic layer writes one JSON record per verdict to `~/.cache/nyx/dynamic/events.jsonl`. Every record begins with a fixed envelope so older readers fail loudly instead of silently mixing incompatible shapes: ```json { "schema_version": 1, "nyx_version": "0.7.0", "corpus_version": "4", "kind": "verdict", "ts": "2026-05-15T18:42:09Z", "finding_id": "a3b1...", "spec_hash": "9f4e...", "lang": "python", "cap": "SQL_QUERY", "status": "Confirmed", "toolchain_id": "python-3.11", "toolchain_match": "exact", "duration_ms": 312, "build_attempts": 1 } ``` | Field | Type | Meaning | | --- | --- | --- | | `schema_version` | integer | Bumped on any breaking change. Readers reject mismatches. | | `nyx_version` | string | `CARGO_PKG_VERSION` of the writing binary. | | `corpus_version` | string | Payload-corpus version the verdict was scored against. | | `kind` | string | `"verdict"` (per-finding) or `"rank_delta"` (rank-score shift). | | `ts` | RFC-3339 string | Wall-clock at write time. | | `finding_id` | string | Stable finding identifier. | | `spec_hash` | string | Hash of the `HarnessSpec` that drove the run. | | `lang` | string | Language slug; `"unknown"` when spec derivation failed. | | `cap` | string | Sink capability (e.g. `SQL_QUERY`, `CODE_EXEC`). | | `status` | string | `Confirmed`, `NotConfirmed`, `Inconclusive`, or `Unsupported`. | | `inconclusive_reason` | string | Present iff `status == Inconclusive`. | A `rank_delta` record carries the envelope plus `finding_id`, `status`, and a signed `delta` applied to the rank score. ### Schema-version mismatch `scripts/m7_ship_gate.sh` Gate 2 walks every line of the log, requires `schema_version == EXPECTED_SCHEMA_VERSION`, and exits 3 if any record fails the check. Programmatic readers use `crate::dynamic::telemetry::read_events(path)`, which surfaces the same condition as `TelemetryReadError::SchemaMismatch { expected, found, .. }`. When schema bumps land, the canonical migration is to roll the log over (move or delete `events.jsonl`) so new and old records never coexist in a file. The gate refuses to skip silently on mismatch. ### Sampling `[telemetry]` in `nyx.toml` controls the on-disk sampling policy: ```toml [telemetry] keep_all_confirmed = true # default: retain every Confirmed verdict keep_all_inconclusive = true # default: retain every Inconclusive verdict sample_rate_other = 1.0 # 0.0–1.0 for NotConfirmed / Unsupported ``` `sample_rate_other < 1.0` downsamples NotConfirmed and Unsupported verdicts deterministically — the decision is seeded by the finding's `spec_hash`, so a given finding makes the same keep-or-drop call across reruns. Confirmed and Inconclusive verdicts ignore the rate and are always retained (they gate the false-Confirmed budget and drive the spec-derivation roadmap). Rank-delta records (emitted by `emit_rank_delta` when a verdict shifts a finding's position in the ranked output) are also retained unconditionally and do **not** consult `sample_rate_other`. They are calibration-critical and small in volume, so the carve-out is intentional; setting `sample_rate_other = 0.0` to throttle log growth will still produce rank-delta lines. `NYX_NO_TELEMETRY=1` disables every write regardless of the policy. ## Opting in to feedback False positives (nyx says `Confirmed` but you disagree) can be recorded: ``` nyx verify-feedback --wrong "reason" ``` This writes to the local telemetry log (`~/.cache/nyx/dynamic/events.jsonl`) and contributes to precision monitoring. Feedback is never uploaded automatically. ## nyx serve integration The browser UI shows `dynamic_verdict` in each finding's detail panel and uses the verdict in ranking (Confirmed findings surface first). The scan compare page has a **Verdict Diff** tab that shows which findings changed verification status between two scans.