nyx/tests/eval_corpus/ground_truth
2026-06-05 10:16:30 -05:00
..
dvpwa.json Dynamic (#77) 2026-06-05 10:16:30 -05:00
dvpwa.manifest.toml Dynamic (#77) 2026-06-05 10:16:30 -05:00
dvwa.json Dynamic (#77) 2026-06-05 10:16:30 -05:00
dvwa.manifest.toml Dynamic (#77) 2026-06-05 10:16:30 -05:00
gosec.json Dynamic (#77) 2026-06-05 10:16:30 -05:00
gosec.manifest.toml Dynamic (#77) 2026-06-05 10:16:30 -05:00
juiceshop.json Dynamic (#77) 2026-06-05 10:16:30 -05:00
juiceshop.manifest.toml Dynamic (#77) 2026-06-05 10:16:30 -05:00
nodegoat.json Dynamic (#77) 2026-06-05 10:16:30 -05:00
nodegoat.manifest.toml Dynamic (#77) 2026-06-05 10:16:30 -05:00
owasp_benchmark_v1.2.json Dynamic (#77) 2026-06-05 10:16:30 -05:00
railsgoat.json Dynamic (#77) 2026-06-05 10:16:30 -05:00
railsgoat.manifest.toml Dynamic (#77) 2026-06-05 10:16:30 -05:00
README.md Dynamic (#77) 2026-06-05 10:16:30 -05:00
rustsec.json Dynamic (#77) 2026-06-05 10:16:30 -05:00
rustsec.manifest.toml Dynamic (#77) 2026-06-05 10:16:30 -05:00

Ground truth files

Place corpus ground truth JSON files here before running tests/eval_corpus/run.sh.

OWASP Benchmark v1.2

File: owasp_benchmark_v1.2.json (checked in; complete — one record per BenchmarkTest file, 2740 total).

Format:

[
  {"path": "src/main/java/org/owasp/.../BenchmarkTest00001.java", "line": 0, "cap": "sqli", "vuln": true},
  ...
]

path is relative to the corpus root (the BenchmarkJava clone), with POSIX separators. tabulate.py suffix-matches it against the absolute paths nyx emits, so the committed JSON is portable: it matches whether the corpus lives at ~/.cache/nyx/eval_corpus/owasp_benchmark_v1.2 on a laptop or at a CI checkout path. line is 0 (the expected-results CSV does not pin a line; matching falls back to file+cap).

Regenerate from expectedresults-1.2beta.csv shipped with the benchmark repo:

python3 tests/eval_corpus/owasp_gt_convert.py \
    --corpus-dir ~/.cache/nyx/eval_corpus/owasp_benchmark_v1.2 \
    --output     tests/eval_corpus/ground_truth/owasp_benchmark_v1.2.json

NIST SARD subset

File: nist_sard.json

Same format. Source: SARD manifest XML converted with python3 tests/eval_corpus/sard_gt_convert.py.

OWASP NodeGoat / OWASP Juice Shop (JS/TS — Track R.1)

Files: nodegoat.json (Express, .js), juiceshop.json (TypeScript, .ts). Same four-field format as above; all records are vuln: true.

These two apps are intentionally vulnerable end to end, so — unlike OWASP Benchmark — they ship no machine-readable per-file vuln labels and have no benign-control files to pair against. The authoritative source is a curated TOML manifest committed here, one [[entry]] per known-vulnerable handler with a note citing why:

  • nodegoat.manifest.toml
  • juiceshop.manifest.toml

manifest_gt_convert.py turns a manifest into the committed .json:

python3 tests/eval_corpus/manifest_gt_convert.py \
    --manifest tests/eval_corpus/ground_truth/nodegoat.manifest.toml \
    --output   tests/eval_corpus/ground_truth/nodegoat.json

Pass --corpus-dir <clone> to validate every labelled path against a real checkout. The converter exits non-zero if any path is missing, so a corpus bump that moves a handler fails loudly instead of silently dropping recall. CI (.github/workflows/eval.yml, jsts job) regenerates each .json against a fresh clone of the pinned ref and asserts it matches the committed file.

Because the manifests label canonical vulns only, recall (did nyx catch the known vulns) is the meaningful metric; precision vs this partial ground truth is informational. Gate 7 publishes per-cap precision/recall/confirmed report-only by default (NYX_JSTS_FLOOR_CAPS empty), matching the OWASP gate.

Polyglot real corpora (Ruby/PHP/Python/Go/Rust — Track R.2)

Phase 29 wires the remaining language families into the same machinery, one corpus per family, each with a curated *.manifest.toml → committed *.json:

  • railsgoat.{manifest.toml,json} — OWASP RailsGoat (Rails, .rb).
  • dvwa.{manifest.toml,json} — Damn Vulnerable Web Application (PHP). DVWA ships graded source variants (source/{low,impossible}.php), so this is the one Track R corpus besides OWASP with real vuln/benign pairs (low.php = vuln, impossible.php = benign control) — precision is meaningful here, not just informational.
  • dvpwa.{manifest.toml,json} — Damn Vulnerable Python Web App (aiohttp, .py). Its parameterized DAO siblings are benign controls for the one %-formatted SQL sink.
  • gosec.{manifest.toml,json} — the gosec Go SAST tool repo; the scannable, // want-annotated sample under goanalysis/testdata is the curated ground truth (gosec's string-embedded rule samples are not scannable, so they are deliberately unlabelled).
  • rustsec.{manifest.toml,json} — RustSec advisory-db, a negative control. advisory-db ships advisory metadata, not vulnerable .rs source, so its committed ground truth is empty ([]) by construction. The manifest sets negative_control = true (mutually exclusive with [[entry]] tables); manifest_gt_convert.py emits the empty JSON and the row asserts the Rust scan/verify path runs at scale within wall-clock and Confirms nothing there (any Confirmed Rust finding is a false confirm).

These are converted, validated and asserted-in-sync exactly like NodeGoat / Juice Shop (the polyglot job in .github/workflows/eval.yml). Because each corpus targets a single language, Gate 8 scopes tabulation to that language (tabulate.py --lang) so the vendored third-party JavaScript these Ruby / Python apps bundle does not pollute their per-cap metrics. Gate 8 publishes per-cap precision/recall/confirmed report-only by default (NYX_POLYGLOT_FLOOR_CAPS empty), matching the OWASP and JS/TS gates. See tests/eval_corpus/budget.toml for the per-(cap,lang) gate policy.