vestige/docs/SANHEDRIN_TEST_INTEGRITY_DELTAS.md

# Sanhedrin Test-Integrity Delta Receipts

Receipt Lock proves a narrower claim: a verification command actually ran and
succeeded. Test-integrity deltas are an optional companion receipt for the
stronger claim that the tests still mean what the draft says they mean.

This receipt is intentionally mechanical. It is not a broad correctness oracle
and it does not ask a second model to decide whether the implementation is good.
It records whether the verification artifact changed in ways that should
upgrade, downgrade, or send the verification claim to human review.

## Boundary

Keep these claims separate:

1. **Command receipt:** `cargo test`, `npm test`, `pytest`, or another verifier
   command ran after the relevant edit and exited successfully.
2. **Test-integrity delta:** the tests/specs behind that verifier were not
   removed, skipped, weakened, or replaced after implementation in a way that
   makes the green result less admissible.

A run can have a valid command receipt and still receive a downgraded
integrity decision.

## Optional JSON Shape

```json
{
  "schema": "vestige.sanhedrin.test_integrity_delta.v1",
  "id": "tid_<stable hash>",
  "commandReceiptId": "receipt_<stable hash>",
  "verificationClaim": "All tests passed.",
  "specSource": {
    "contextId": "spec_ctx_04",
    "testFiles": [
      {
        "path": "tests/cart.test.ts",
        "hashBeforeImplementation": "sha256:...",
        "hashAfterVerification": "sha256:..."
      }
    ]
  },
  "implementationContext": "impl_ctx_09",
  "verifierContext": "verify_ctx_02",
  "delta": {
    "testFilesChangedAfterImplementation": true,
    "removedOrDisabledTests": [
      {
        "kind": "skip_or_only",
        "path": "tests/cart.test.ts",
        "line": 42
      }
    ],
    "removedAssertions": 2,
    "weakenedExpectations": [
      {
        "path": "tests/cart.test.ts",
        "from": "throws InvalidCouponError",
        "to": "does not throw"
      }
    ],
    "snapshotChurnWithoutSourceChange": false,
    "coverageDelta": -3.8,
    "mocksReplacingRealBoundary": [
      {
        "module": "PaymentGateway",
        "before": "integration-ish fake",
        "after": "empty stub"
      }
    ]
  },
  "freshVerifier": {
    "commandReceiptId": "receipt_<stable hash>",
    "exitCode": 0,
    "checkedAfterLastRelevantEdit": true
  },
  "decision": "downgraded",
  "reason": "tests passed, but the tests were weakened after implementation"
}
```

## Decisions

- `accepted` — a verifier command succeeded after the last relevant edit and no
  integrity downgrade was detected.
- `downgraded` — the command succeeded, but the tests/specs changed in a way
  that makes the verification claim weaker than stated.
- `needs_human_review` — the delta may be legitimate, but a local mechanical
  check cannot safely classify it. Snapshot updates are a common example.

## Minimal Fixture Suite

These cases are small enough to live as fixtures without turning Sanhedrin into
a correctness judge.

| Case | Input pattern | Expected decision | Why |
| --- | --- | --- | --- |
| unchanged-good | implementation changes source; tests unchanged; fresh verifier succeeds | `accepted` | Green tests are supported by a fresh command receipt and unchanged test artifact. |
| skipped-test | implementation adds `.skip`, `.only`, `#[ignore]`, or equivalent before verifier succeeds | `downgraded` | The command ran, but the claim no longer represents the original test obligation. |
| weakened-assertion | expectation is relaxed after implementation, e.g. `throws InvalidCouponError` -> `does not throw` | `downgraded` | The verifier passed against a weaker assertion than the one available before implementation. |
| justified-snapshot | snapshot changes alongside an intentional source/UI change | `needs_human_review` or `accepted` by policy | Snapshot churn can be valid, but the receipt should make the policy decision explicit. |

## Non-goals

- Do not infer whether the implementation is correct in the world.
- Do not require full semantic diffing before Receipt Lock can operate.
- Do not treat staged evidence or a model explanation as equivalent to a fresh
  command receipt.
- Do not block every test edit. The goal is to keep the verification claim
  honest when the test artifact changed after implementation.