11 KiB
Recall validation runbook
The recall-validation harness freezes a finding-shape baseline against real-world OSS targets so future engine work can prove "actually lifts recall on real code", not just "tests pass". This runbook covers re-running the validation against a fresh OSS release.
Targets
| Target | Clone URL | Recall items exercised |
|---|---|---|
cal_com |
https://github.com/calcom/cal.com | 1, 5, 6, 7 |
vercel_commerce |
https://github.com/vercel/commerce | 1, 4, 7 |
shadcn_examples |
https://github.com/shadcn-ui/ui | 4, 7 |
blitz_apps |
https://github.com/blitz-js/blitz | 1, 3, 6 |
Item numbering is from .pitboss/RECALL_GAPS.md.
Files
| File | Role |
|---|---|
scripts/validate_recall.sh |
runner (capture + diff modes) |
tests/recall_targets/<target>.json |
per-target baseline |
tests/recall_gaps.rs::validate_real_world_targets |
schema-validity test (#[ignore]) |
tests/recall_gaps_baseline.json |
corpus regression baseline |
Baselines live next to the harness rather than under .pitboss/:
pitboss implementer agents are forbidden to write under .pitboss/,
so the baseline files were placed beside the test that consumes them.
Baseline schema
{
"_doc": "...",
"target": "cal_com",
"clone_url": "https://github.com/calcom/cal.com",
"exercises_recall_items": [1, 5, 6, 7],
"captured_against": "real-scan @ <sha>",
"captured_on": "YYYY-MM-DD",
"pinned_commit": "<sha>",
"findings": [
{
"rule_id": "taint-unsanitised-flow",
"path_suffix": "packages/...",
"line": 130,
"severity": "High",
"verdict": "TP" | "FP" | "needs_review",
"note": "..."
}
]
}
The diff key is (rule_id, path_suffix, line). The verdict field
must be one of TP, FP, or needs_review; unknown verdicts are
rejected by the schema test.
Usage
Diff a fresh scan against the frozen baseline
scripts/validate_recall.sh cal_com /path/to/cal.com
Output is a JSON object { added, removed, unchanged, *_total }
keyed by rule_id. Use this to spot intentional recall lift
(added) and regressions (removed).
Refresh the baseline after an intentional recall lift
scripts/validate_recall.sh cal_com /path/to/cal.com --capture
This overwrites tests/recall_targets/cal_com.json with the current
scan output. Every finding is re-marked verdict: "needs_review";
hand-label TP/FP afterwards as you triage.
Schema-validity check
cargo test --release --test recall_gaps -- --ignored validate_real_world_targets
Loads each per-target JSON, asserts the required keys exist, and asserts every finding carries a valid verdict label.
Refresh procedure
- Clone or pull the target repo into
~/oss/<target>(or wherever). - Build nyx:
cargo build --release. - Run the diff in plain mode to see what changed:
scripts/validate_recall.sh <target> ~/oss/<target>. - If the lift is intentional, recapture:
scripts/validate_recall.sh <target> ~/oss/<target> --capture. - Spot-check a handful of new findings. Open the file at
path_suffix:lineand confirm the source-to-sink flow is real. Hand-label themTP/FP. - Commit the updated
tests/recall_targets/<target>.json.
Known captured baselines (2026-05-08)
| Target | Pinned commit | Findings | TP | FP | needs_review |
|---|---|---|---|---|---|
cal_com |
d278d6c9 |
662 | 0 | 4 | 658 |
vercel_commerce |
unknown | 0 (placeholder) | |||
shadcn_examples |
unknown | 0 (placeholder) | |||
blitz_apps |
unknown | 0 (placeholder) |
The cal_com capture used commit d278d6c9bc535bf3f2c6ba0607654f78dd74d6ee
(refactor: remove dead insights references (#29029)). The 4 FP
labels are ts.crypto.math_random hits inside apps/web/playwright/
test fixtures, which are not a security context.
The other three targets ship as placeholders (empty findings).
Nobody has cloned them locally yet. Run validate_recall.sh <target> <clone> --capture to populate. The schema test still passes
because [] is a valid findings array with zero entries to check.
Perf baseline
The frozen JS-target perf snapshot lives in
tests/recall_targets/perf_after.txt. Compare against the
captured_against snapshot in tests/recall_gaps_baseline.json
(corpus_finding_lines.findings_total = 1121, captured at master
ea82ea98). The acceptance bar: scanner throughput on the existing
tests/fixtures/ corpus must regress by no more than 15%. Future
recall work uses the same corpus and the same record file to measure
its own perf delta.
Cross-language runbook
The JS-target baselines above only cover JS/TS. Cross-language
baselines mirror that work against real-world non-JS targets so
multi-language engine changes can be measured against actual code,
not just synthetic fixtures. Per-lang baselines live under
tests/recall_targets/xlang/<lang>/<target>.json and the runner
accepts a --lang flag to select the target set.
Cross-language targets
| Lang | Target | Clone URL | Pinned commit (capture) | Findings | Notes |
|---|---|---|---|---|---|
| php | phpmyadmin | https://github.com/phpmyadmin/phpmyadmin | ddf4e993 |
119 | DBA UI; XSS / php.deser / cfg-unguarded-sink heavy. |
| php | joomla | https://github.com/joomla/joomla-cms | 7e8527d0 |
83 | CMS; php.deser.unserialize and php.path.include_variable clusters. |
| php | drupal | https://github.com/drupal/drupal | 92aa759e |
635 | CMS / DI container; cfg-unguarded-sink (198) and taint-prototype-pollution (121) dominant. |
| php | nextcloud | https://github.com/nextcloud/server | 5c0fe4c3 |
262 | File-sync platform; cfg-resource-leak / state-resource-leak heavy. |
| java | openmrs | https://github.com/openmrs/openmrs-core | f9c76db2 |
273 | Hibernate-heavy; JPA Criteria fix from project_realrepo_openmrs.md already applied. |
| python | airflow | https://github.com/apache/airflow | 3d42610a |
892 | Scheduler / DAG runner; cfg-unguarded-sink (252) and taint-unsanitised-flow (179) lead. |
| python | flask | https://github.com/pallets/flask | placeholder | 0 | Smaller-surface Python framework; capture deferred. |
| go | gin | https://github.com/gin-gonic/gin | d3ffc998 |
20 | HTTP framework test corpus; taint-header-injection and TLS skip-verify in tests. |
| rust | axum | https://github.com/tokio-rs/axum | placeholder | 0 | Not cloned in pitboss sandbox at capture time; populate locally. |
| ruby | rails | https://github.com/rails/rails | placeholder | 0 | Capture against the actionpack/ subtree once cloned. |
Captures dated 2026-05-09 (UTC). Counts are deduplicated tuples
(rule_id, path_suffix, line). Duplicate raw findings collapse on
the diff key, so the schema-test count and diff-mode unchanged_total
may differ from the findings | length total by a handful of
duplicate sites. The diff key is what matters for regression
detection.
Per-lang TP/FP splits
Every captured finding ships with verdict: "needs_review" from
--capture. Hand-triage is bounded but pending; none of the cross-
language captures are sweep-labelled yet. Use the per-lang dominant
rule_id clusters above as the priority queue:
- PHP:
cfg-unguarded-sinkandtaint-prototype-pollutionare the FP-dominant clusters across drupal / nextcloud / phpmyadmin (CMS routing + JS object construction).php.deser.unserializeis the highest-value TP cluster on joomla (17) and drupal (83). Seeproject_realrepo_joomla.md2026-05-03 for the magic-method passthrough fix that already filters one shape. - Java:
taint-unsanitised-flow(61) andstate-resource-leak(60) are openmrs's leading clusters. The JPA Criteria-API fix already absorbed thecfg-unguarded-sinkcluster (216 to 24); remaining Hibernate / Spring resource-management FPs are the next triage target. - Python:
cfg-unguarded-sink(252) on airflow is dominated by Airflow's scheduler / DB plumbing;py.auth.token_override_*(83) andpy.auth.missing_ownership_check(61) are the auth-rule noise typical of an admin/operator codebase. - Go: gin's 20 findings are mostly test-corpus artifacts
(
gin_test.go,routes_test.go); 4 of 4go.transport.insecure_skip_verifyhits are insidegin*_test.goand are legitimate test setup. - Rust / Ruby: placeholder. Capture once a local clone exists.
--lang runner usage
# diff mode (default)
scripts/validate_recall.sh --lang php drupal /Users/me/oss/drupal
scripts/validate_recall.sh --lang java openmrs /Users/me/oss/openmrs
# capture / refresh
scripts/validate_recall.sh --lang go gin /Users/me/oss/gin --capture
Output is the same { added, removed, unchanged, *_total } JSON shape
as the JS-target diff. The diff key is (rule_id, path_suffix, line).
Cross-language refresh procedure
- Clone or update the target into
~/oss/<target>(or wherever). - Build nyx:
cargo build --release. - Diff vs the frozen baseline:
scripts/validate_recall.sh --lang <lang> <target> ~/oss/<target>. - If the lift is intentional, recapture with
--capture. - Spot-check new findings; hand-label
TP/FP. - Commit the updated
tests/recall_targets/xlang/<lang>/<target>.json.
Sandbox-capture caveat
Pitboss implementer agents run sandboxed without network egress, so
target repos that are not already present under ~/oss/ ship as
placeholders (pinned_commit: "unknown", findings: []). The
current cross-language baselines cover php / java / python / go
(every target whose repo was already cloned locally) and ship
placeholders for rust/axum, ruby/rails, and python/flask. The
schema test in validate_real_world_targets passes against
placeholders because [] is a valid findings array.
What lives where (quick reference)
- Targets list and recall-item mapping in this file.
- Per-target JS findings under
tests/recall_targets/<target>.json. - Per-target cross-lang findings under
tests/recall_targets/xlang/<lang>/<target>.json. - Diff/capture runner at
scripts/validate_recall.sh(accepts--lang). - Schema-validity test at
tests/recall_gaps.rs::validate_real_world_targets. - Corpus regression baseline at
tests/recall_gaps_baseline.json. - Perf records at
tests/recall_targets/perf_after.txt(JS-target snapshot) andtests/recall_targets/perf_after_xlang.txt(cross-language delta).