plano/tests/parity/signals
Syed A. Hashmi c8079ac971
signals: feature parity with the latest Signals paper. Porting logic from python repo (#903)
* signals: port to layered taxonomy with dual-emit OTel

Made-with: Cursor

* fix: silence collapsible_match clippy lint (rustc 1.95)

Made-with: Cursor

* test: parity harness for rust vs python signals analyzer

Validates the brightstaff signals port against the katanemo/signals Python
reference on lmsys/lmsys-chat-1m. Adds a signals_replay bin emitting python-
compatible JSON, a pyarrow-based driver (bypasses the datasets loader pickle
bug on python 3.14), a 3-tier comparator, and an on-demand workflow_dispatch
CI job.

Made-with: Cursor

* Remove signals test from the gitops flow

* style: format parity harness with black

Made-with: Cursor

* signals: group summary by taxonomy, factor misalignment_ratio

Addresses #903 review feedback from @nehcgs:

- generate_summary() now renders explicit Interaction / Execution /
  Environment headers so the paper taxonomy is visible at a glance,
  even when no signals fired in a given layer. Quality-driving callouts
  (high misalignment rate, looping detected, escalation requested) are
  appended after the layer summary as an alerts tail.

- repair_ratio (legacy taxonomy name) renamed to misalignment_ratio
  and factored into a single InteractionSignals::misalignment_ratio()
  helper so assess_quality and generate_summary share one source of
  truth instead of recomputing the same divide twice.

Two new unit tests pin the layer headers and the (sev N) severity
suffix. Parity with the python reference is preserved at the Tier-A
level (per-type counts + overall_quality); only the human-readable
summary string diverges, which the parity comparator already classifies
as Tier-C.

Made-with: Cursor
2026-04-23 12:02:30 -07:00
..
.gitignore signals: feature parity with the latest Signals paper. Porting logic from python repo (#903) 2026-04-23 12:02:30 -07:00
_smoke_test.py signals: feature parity with the latest Signals paper. Porting logic from python repo (#903) 2026-04-23 12:02:30 -07:00
compare.py signals: feature parity with the latest Signals paper. Porting logic from python repo (#903) 2026-04-23 12:02:30 -07:00
README.md signals: feature parity with the latest Signals paper. Porting logic from python repo (#903) 2026-04-23 12:02:30 -07:00
requirements.txt signals: feature parity with the latest Signals paper. Porting logic from python repo (#903) 2026-04-23 12:02:30 -07:00
run_parity.py signals: feature parity with the latest Signals paper. Porting logic from python repo (#903) 2026-04-23 12:02:30 -07:00

Signals Parity Harness

Validates that crates/brightstaff/src/signals/ (Rust port) produces the same SignalReport as the Python reference at https://github.com/katanemo/signals on a fixed sample of lmsys/lmsys-chat-1m conversations.

This harness is not part of normal CI. It downloads several GB and is run on demand to gate releases of the signals subsystem (or to investigate regressions reported in production).

What gets compared

For each conversation, both analyzers emit a SignalReport. The comparator classifies any divergence into three tiers:

Tier Field Action on divergence
A set of SignalType present, per-type counts, overall_quality Fail the run
B per-instance message_index, instance counts per type Log + collect, do not fail
C metadata, snippet text, summary Information only

Quality buckets are compared by string (excellent / good / ...).

What this harness does not cover

lmsys-chat-1m is plain user/assistant chat. It exercises the interaction layer well (misalignment, stagnation, disengagement, satisfaction) but does not exercise:

  • execution.failure.*
  • execution.loops.*
  • environment.exhaustion.*

Those signals require function_call / observation ShareGPT roles. They are covered by the Rust unit tests and the Python repo's own test fixtures, both of which run on every PR. A synthetic tool-trace dataset for full coverage is deferred to a follow-up.

One-time setup

# 1. Build the Rust replay binary.
cd ../../../crates && cargo build --release -p brightstaff --bin signals_replay

# 2. Set up the Python environment for the harness driver.
cd ../tests/parity/signals
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# 3. Install the Python signals reference.
#    Either point at a local checkout:
pip install -e /path/to/signals
#    or pull from git:
pip install 'signals @ git+https://github.com/katanemo/signals@<sha>'

Running

source .venv/bin/activate

python run_parity.py \
    --num-samples 2000 \
    --seed 42 \
    --dataset-revision <hf-dataset-revision-sha> \
    --rust-binary ../../../crates/target/release/signals_replay \
    --output-dir out/

python compare.py --output-dir out/

run_parity.py will:

  1. Download lmsys/lmsys-chat-1m (cached in ~/.cache/huggingface).
  2. Pick --num-samples rows under --seed.
  3. Convert each to ShareGPT, write out/conversations.jsonl.
  4. Run the Rust binary as a subprocess → out/rust_reports.jsonl.
  5. Run the Python analyzer in-process → out/python_reports.jsonl.

compare.py reads both report files and writes:

  • out/diffs.jsonl — one record per mismatched conversation, with tier + structural diff
  • out/metrics.json — agreement %, per-SignalType confusion matrix, quality-bucket confusion matrix
  • out/summary.md — human-readable PR-ready report

Exit code is non-zero iff any Tier-A divergence is observed.

Reproducibility

Every run pins:

  • dataset_revision — the HF dataset commit
  • seed — RNG seed for sampling
  • signals_python_versionpip show signals version
  • plano_git_shagit rev-parse HEAD of this repo
  • signals_replay_binary_sha256 — the hash of the Rust bin

All are stamped into metrics.json.