mirror of
https://github.com/katanemo/plano.git
synced 2026-05-02 12:22:43 +02:00
* signals: port to layered taxonomy with dual-emit OTel Made-with: Cursor * fix: silence collapsible_match clippy lint (rustc 1.95) Made-with: Cursor * test: parity harness for rust vs python signals analyzer Validates the brightstaff signals port against the katanemo/signals Python reference on lmsys/lmsys-chat-1m. Adds a signals_replay bin emitting python- compatible JSON, a pyarrow-based driver (bypasses the datasets loader pickle bug on python 3.14), a 3-tier comparator, and an on-demand workflow_dispatch CI job. Made-with: Cursor * Remove signals test from the gitops flow * style: format parity harness with black Made-with: Cursor * signals: group summary by taxonomy, factor misalignment_ratio Addresses #903 review feedback from @nehcgs: - generate_summary() now renders explicit Interaction / Execution / Environment headers so the paper taxonomy is visible at a glance, even when no signals fired in a given layer. Quality-driving callouts (high misalignment rate, looping detected, escalation requested) are appended after the layer summary as an alerts tail. - repair_ratio (legacy taxonomy name) renamed to misalignment_ratio and factored into a single InteractionSignals::misalignment_ratio() helper so assess_quality and generate_summary share one source of truth instead of recomputing the same divide twice. Two new unit tests pin the layer headers and the (sev N) severity suffix. Parity with the python reference is preserved at the Tier-A level (per-type counts + overall_quality); only the human-readable summary string diverges, which the parity comparator already classifies as Tier-C. Made-with: Cursor
98 lines
3.4 KiB
Markdown
98 lines
3.4 KiB
Markdown
# Signals Parity Harness
|
|
|
|
Validates that `crates/brightstaff/src/signals/` (Rust port) produces the same
|
|
`SignalReport` as the Python reference at <https://github.com/katanemo/signals>
|
|
on a fixed sample of `lmsys/lmsys-chat-1m` conversations.
|
|
|
|
This harness is **not** part of normal CI. It downloads several GB and is run
|
|
on demand to gate releases of the signals subsystem (or to investigate
|
|
regressions reported in production).
|
|
|
|
## What gets compared
|
|
|
|
For each conversation, both analyzers emit a `SignalReport`. The comparator
|
|
classifies any divergence into three tiers:
|
|
|
|
| Tier | Field | Action on divergence |
|
|
|------|------------------------------------------------|----------------------|
|
|
| A | set of `SignalType` present, per-type counts, `overall_quality` | Fail the run |
|
|
| B | per-instance `message_index`, instance counts per type | Log + collect, do not fail |
|
|
| C | metadata, snippet text, summary | Information only |
|
|
|
|
Quality buckets are compared by string (`excellent` / `good` / ...).
|
|
|
|
## What this harness does *not* cover
|
|
|
|
`lmsys-chat-1m` is plain user/assistant chat. It exercises the **interaction**
|
|
layer well (misalignment, stagnation, disengagement, satisfaction) but does
|
|
**not** exercise:
|
|
|
|
- `execution.failure.*`
|
|
- `execution.loops.*`
|
|
- `environment.exhaustion.*`
|
|
|
|
Those signals require `function_call` / `observation` ShareGPT roles. They are
|
|
covered by the Rust unit tests and the Python repo's own test fixtures, both
|
|
of which run on every PR. A synthetic tool-trace dataset for full coverage is
|
|
deferred to a follow-up.
|
|
|
|
## One-time setup
|
|
|
|
```bash
|
|
# 1. Build the Rust replay binary.
|
|
cd ../../../crates && cargo build --release -p brightstaff --bin signals_replay
|
|
|
|
# 2. Set up the Python environment for the harness driver.
|
|
cd ../tests/parity/signals
|
|
python3 -m venv .venv && source .venv/bin/activate
|
|
pip install -r requirements.txt
|
|
|
|
# 3. Install the Python signals reference.
|
|
# Either point at a local checkout:
|
|
pip install -e /path/to/signals
|
|
# or pull from git:
|
|
pip install 'signals @ git+https://github.com/katanemo/signals@<sha>'
|
|
```
|
|
|
|
## Running
|
|
|
|
```bash
|
|
source .venv/bin/activate
|
|
|
|
python run_parity.py \
|
|
--num-samples 2000 \
|
|
--seed 42 \
|
|
--dataset-revision <hf-dataset-revision-sha> \
|
|
--rust-binary ../../../crates/target/release/signals_replay \
|
|
--output-dir out/
|
|
|
|
python compare.py --output-dir out/
|
|
```
|
|
|
|
`run_parity.py` will:
|
|
|
|
1. Download `lmsys/lmsys-chat-1m` (cached in `~/.cache/huggingface`).
|
|
2. Pick `--num-samples` rows under `--seed`.
|
|
3. Convert each to ShareGPT, write `out/conversations.jsonl`.
|
|
4. Run the Rust binary as a subprocess → `out/rust_reports.jsonl`.
|
|
5. Run the Python analyzer in-process → `out/python_reports.jsonl`.
|
|
|
|
`compare.py` reads both report files and writes:
|
|
|
|
- `out/diffs.jsonl` — one record per mismatched conversation, with tier + structural diff
|
|
- `out/metrics.json` — agreement %, per-`SignalType` confusion matrix, quality-bucket confusion matrix
|
|
- `out/summary.md` — human-readable PR-ready report
|
|
|
|
Exit code is non-zero iff any Tier-A divergence is observed.
|
|
|
|
## Reproducibility
|
|
|
|
Every run pins:
|
|
|
|
- `dataset_revision` — the HF dataset commit
|
|
- `seed` — RNG seed for sampling
|
|
- `signals_python_version` — `pip show signals` version
|
|
- `plano_git_sha` — `git rev-parse HEAD` of this repo
|
|
- `signals_replay_binary_sha256` — the hash of the Rust bin
|
|
|
|
All are stamped into `metrics.json`.
|