test: parity harness for rust vs python signals analyzer

Validates the brightstaff signals port against the katanemo/signals Python reference on lmsys/lmsys-chat-1m. Adds a signals_replay bin emitting python- compatible JSON, a pyarrow-based driver (bypasses the datasets loader pickle bug on python 3.14), a 3-tier comparator, and an on-demand workflow_dispatch CI job. Made-with: Cursor
2026-06-05 14:45:15 +02:00 · 2026-04-22 12:28:22 -07:00 · 2026-04-22 12:28:22 -07:00 · d32ffb0450
commit d32ffb0450
parent bb4ddaa7f2
9 changed files with 1118 additions and 0 deletions
--- a/tests/parity/signals/README.md
+++ b/tests/parity/signals/README.md
@ -0,0 +1,98 @@
+# Signals Parity Harness
+
+Validates that `crates/brightstaff/src/signals/` (Rust port) produces the same
+`SignalReport` as the Python reference at <https://github.com/katanemo/signals>
+on a fixed sample of `lmsys/lmsys-chat-1m` conversations.
+
+This harness is **not** part of normal CI. It downloads several GB and is run
+on demand to gate releases of the signals subsystem (or to investigate
+regressions reported in production).
+
+## What gets compared
+
+For each conversation, both analyzers emit a `SignalReport`. The comparator
+classifies any divergence into three tiers:
+
+| Tier | Field                                          | Action on divergence |
+|------|------------------------------------------------|----------------------|
+| A    | set of `SignalType` present, per-type counts, `overall_quality` | Fail the run |
+| B    | per-instance `message_index`, instance counts per type          | Log + collect, do not fail |
+| C    | metadata, snippet text, summary                                  | Information only |
+
+Quality buckets are compared by string (`excellent` / `good` / ...).
+
+## What this harness does *not* cover
+
+`lmsys-chat-1m` is plain user/assistant chat. It exercises the **interaction**
+layer well (misalignment, stagnation, disengagement, satisfaction) but does
+**not** exercise:
+
+- `execution.failure.*`
+- `execution.loops.*`
+- `environment.exhaustion.*`
+
+Those signals require `function_call` / `observation` ShareGPT roles. They are
+covered by the Rust unit tests and the Python repo's own test fixtures, both
+of which run on every PR. A synthetic tool-trace dataset for full coverage is
+deferred to a follow-up.
+
+## One-time setup
+
+```bash
+# 1. Build the Rust replay binary.
+cd ../../../crates && cargo build --release -p brightstaff --bin signals_replay
+
+# 2. Set up the Python environment for the harness driver.
+cd ../tests/parity/signals
+python3 -m venv .venv && source .venv/bin/activate
+pip install -r requirements.txt
+
+# 3. Install the Python signals reference.
+#    Either point at a local checkout:
+pip install -e /path/to/signals
+#    or pull from git:
+pip install 'signals @ git+https://github.com/katanemo/signals@<sha>'
+```
+
+## Running
+
+```bash
+source .venv/bin/activate
+
+python run_parity.py \
+    --num-samples 2000 \
+    --seed 42 \
+    --dataset-revision <hf-dataset-revision-sha> \
+    --rust-binary ../../../crates/target/release/signals_replay \
+    --output-dir out/
+
+python compare.py --output-dir out/
+```
+
+`run_parity.py` will:
+
+1. Download `lmsys/lmsys-chat-1m` (cached in `~/.cache/huggingface`).
+2. Pick `--num-samples` rows under `--seed`.
+3. Convert each to ShareGPT, write `out/conversations.jsonl`.
+4. Run the Rust binary as a subprocess → `out/rust_reports.jsonl`.
+5. Run the Python analyzer in-process → `out/python_reports.jsonl`.
+
+`compare.py` reads both report files and writes:
+
+- `out/diffs.jsonl`     — one record per mismatched conversation, with tier + structural diff
+- `out/metrics.json`    — agreement %, per-`SignalType` confusion matrix, quality-bucket confusion matrix
+- `out/summary.md`      — human-readable PR-ready report
+
+Exit code is non-zero iff any Tier-A divergence is observed.
+
+## Reproducibility
+
+Every run pins:
+
+- `dataset_revision` — the HF dataset commit
+- `seed` — RNG seed for sampling
+- `signals_python_version` — `pip show signals` version
+- `plano_git_sha` — `git rev-parse HEAD` of this repo
+- `signals_replay_binary_sha256` — the hash of the Rust bin
+
+All are stamped into `metrics.json`.