mirror of
https://github.com/samvallad33/vestige.git
synced 2026-06-20 21:18:08 +02:00
Merge remote-tracking branch 'origin/main' into codex/opencode-sigill-salvage
This commit is contained in:
commit
ea5ed28081
26 changed files with 6997 additions and 91 deletions
159
docs/COMPOSED_GRAPH.md
Normal file
159
docs/COMPOSED_GRAPH.md
Normal file
|
|
@ -0,0 +1,159 @@
|
|||
# ComposedGraph
|
||||
|
||||
ComposedGraph records memory combinations as durable reasoning events.
|
||||
|
||||
Most memory systems store facts, entities, or relationships. ComposedGraph stores a
|
||||
different object: which memories were used together, why they were used, and what
|
||||
happened afterward.
|
||||
|
||||
## Model
|
||||
|
||||
`composition_events` stores the reasoning envelope:
|
||||
|
||||
- tool and mode, such as `deep_reference` or `bounty`
|
||||
- query and query hash
|
||||
- confidence, status, and output preview
|
||||
- metadata for intent, analyzed memory count, activation expansion, and reasoning preview
|
||||
|
||||
`composition_members` stores the participating memories:
|
||||
|
||||
- memory id
|
||||
- role, such as `primary`, `supporting`, `contradicting`, or `superseded`
|
||||
- rank, trust, relevance score, preview, and metadata
|
||||
|
||||
`composition_outcomes` stores later labels:
|
||||
|
||||
- `helpful`
|
||||
- `dead_end`
|
||||
- `submitted`
|
||||
- `accepted`
|
||||
- `rejected`
|
||||
- `duplicate_risk`
|
||||
- `needs_poc`
|
||||
- `bad_severity`
|
||||
- `user_promoted`
|
||||
- `user_demoted`
|
||||
- `closed_by_scope`
|
||||
- `closed_by_duplicate`
|
||||
- `closed_by_false_assumption`
|
||||
- `closed_by_user`
|
||||
- `expired_lane`
|
||||
|
||||
Member memory ids are intentionally historical references, not foreign keys into
|
||||
`knowledge_nodes`. Purging or superseding a memory should not erase the fact that
|
||||
it once participated in a reasoning path.
|
||||
|
||||
## MCP Tool
|
||||
|
||||
Use `composed_graph` for read/write access to the composition ledger.
|
||||
|
||||
```json
|
||||
{ "action": "recent", "limit": 10 }
|
||||
```
|
||||
|
||||
```json
|
||||
{ "action": "get", "event_id": "<composition-event-id>" }
|
||||
```
|
||||
|
||||
```json
|
||||
{ "action": "memory", "memory_id": "<memory-id>", "limit": 10 }
|
||||
```
|
||||
|
||||
```json
|
||||
{ "action": "neighbors", "memory_id": "<memory-id>", "limit": 10 }
|
||||
```
|
||||
|
||||
```json
|
||||
{ "action": "never_composed", "tags": ["project:vestige"], "limit": 10 }
|
||||
```
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "label",
|
||||
"event_id": "<composition-event-id>",
|
||||
"outcome_type": "helpful",
|
||||
"notes": "This combination led to the accepted fix."
|
||||
}
|
||||
```
|
||||
|
||||
## Never-Composed Frontier
|
||||
|
||||
`never_composed` returns pairs that have not yet appeared together in a
|
||||
composition event.
|
||||
|
||||
The ranking is intentionally not just shared-tag matching. It combines:
|
||||
|
||||
- exact shared tags
|
||||
- shared meaningful content terms
|
||||
- boundary tags such as `boundary-*`, `oracle`, `queue`, `settlement`, `upgrade`,
|
||||
`pause`, `accounting`, or `scope`
|
||||
- node-type diversity
|
||||
- FSRS retention strength
|
||||
- composition novelty, so memories that have not already been heavily composed
|
||||
still get surfaced
|
||||
- prior composition outcomes from either member, so previously accepted,
|
||||
duplicate-risk, or dead-end lanes shape the frontier without hiding it
|
||||
|
||||
Each candidate includes:
|
||||
|
||||
- `score`
|
||||
- `noveltyScore`
|
||||
- `bridgeScore`
|
||||
- `trustScore`
|
||||
- `outcomeScoreAdjustment`
|
||||
- `sharedTags`
|
||||
- `boundaryTags`
|
||||
- `sharedTerms`
|
||||
- `priorOutcomes`
|
||||
- `outcomeSignal`, such as `clean`, `prior_success`, `prior_duplicate_risk`,
|
||||
`prior_closed_door`, or `mixed_prior_outcomes`
|
||||
- node types
|
||||
- previews
|
||||
- a short reason
|
||||
- a `compositionQuestion` that an agent can answer before taking action
|
||||
|
||||
The output is a frontier queue, not a finding. A never-composed pair means
|
||||
"worth investigating," not "true," "novel," or "reportable."
|
||||
Prior outcomes are also guardrails, not verdicts: a duplicate-risk signal should
|
||||
make the agent check duplicate families first, while a success signal should make
|
||||
it inspect why the older composition worked.
|
||||
|
||||
Closed-door labels should be specific when possible. Prefer `closed_by_scope`,
|
||||
`closed_by_duplicate`, `closed_by_false_assumption`, `closed_by_user`, or
|
||||
`expired_lane` over a generic `dead_end` when the reason is known.
|
||||
|
||||
## Bounty / Research Mode
|
||||
|
||||
`bounty_mode` is a higher-level read shape for investigative workflows. It returns:
|
||||
|
||||
- recent already-composed lanes
|
||||
- never-composed lanes
|
||||
- closed doors
|
||||
- duplicate-risk lanes
|
||||
- lanes that need proof-of-concept work
|
||||
- top weird combinations
|
||||
|
||||
This is useful for security research, bug triage, architecture work, and product
|
||||
strategy because failed or duplicate compositions are preserved instead of being
|
||||
rediscovered repeatedly.
|
||||
|
||||
## Deep Reference Integration
|
||||
|
||||
`deep_reference` persists composition events automatically when it has evidence
|
||||
members. Empty evidence does not create a ledger event.
|
||||
|
||||
The response includes:
|
||||
|
||||
- `composition_event_id` when persisted
|
||||
- `compositionWriteStatus`, usually `persisted` or `skipped_empty`
|
||||
|
||||
## Design Direction
|
||||
|
||||
The next useful upgrades are:
|
||||
|
||||
- triple or n-ary candidate mining, not only pairs
|
||||
- structural-fit scoring for analogies, separate from surface similarity
|
||||
- trust-zone scoring so a composition is limited by its weakest provenance
|
||||
- temporal replay: "what combinations were available when this decision was made?"
|
||||
- evaluation tasks where success requires combining memories that were never
|
||||
previously co-composed
|
||||
|
|
@ -12,6 +12,8 @@ instead of opaque. The current schema is `vestige.sanhedrin.receipt.v1`.
|
|||
- Appeals: `~/.vestige/sanhedrin/appeals.jsonl`
|
||||
- Fail-open events: `~/.vestige/sanhedrin/fail-open.jsonl`
|
||||
|
||||
Optional companion schema: [`SANHEDRIN_TEST_INTEGRITY_DELTAS.md`](SANHEDRIN_TEST_INTEGRITY_DELTAS.md) describes mechanical deltas for cases where a verifier command passed but the test artifact changed after implementation.
|
||||
|
||||
## v1 JSON Shape
|
||||
|
||||
```json
|
||||
|
|
|
|||
110
docs/SANHEDRIN_TEST_INTEGRITY_DELTAS.md
Normal file
110
docs/SANHEDRIN_TEST_INTEGRITY_DELTAS.md
Normal file
|
|
@ -0,0 +1,110 @@
|
|||
# Sanhedrin Test-Integrity Delta Receipts
|
||||
|
||||
Receipt Lock proves a narrower claim: a verification command actually ran and
|
||||
succeeded. Test-integrity deltas are an optional companion receipt for the
|
||||
stronger claim that the tests still mean what the draft says they mean.
|
||||
|
||||
This receipt is intentionally mechanical. It is not a broad correctness oracle
|
||||
and it does not ask a second model to decide whether the implementation is good.
|
||||
It records whether the verification artifact changed in ways that should
|
||||
upgrade, downgrade, or send the verification claim to human review.
|
||||
|
||||
## Boundary
|
||||
|
||||
Keep these claims separate:
|
||||
|
||||
1. **Command receipt:** `cargo test`, `npm test`, `pytest`, or another verifier
|
||||
command ran after the relevant edit and exited successfully.
|
||||
2. **Test-integrity delta:** the tests/specs behind that verifier were not
|
||||
removed, skipped, weakened, or replaced after implementation in a way that
|
||||
makes the green result less admissible.
|
||||
|
||||
A run can have a valid command receipt and still receive a downgraded
|
||||
integrity decision.
|
||||
|
||||
## Optional JSON Shape
|
||||
|
||||
```json
|
||||
{
|
||||
"schema": "vestige.sanhedrin.test_integrity_delta.v1",
|
||||
"id": "tid_<stable hash>",
|
||||
"commandReceiptId": "receipt_<stable hash>",
|
||||
"verificationClaim": "All tests passed.",
|
||||
"specSource": {
|
||||
"contextId": "spec_ctx_04",
|
||||
"testFiles": [
|
||||
{
|
||||
"path": "tests/cart.test.ts",
|
||||
"hashBeforeImplementation": "sha256:...",
|
||||
"hashAfterVerification": "sha256:..."
|
||||
}
|
||||
]
|
||||
},
|
||||
"implementationContext": "impl_ctx_09",
|
||||
"verifierContext": "verify_ctx_02",
|
||||
"delta": {
|
||||
"testFilesChangedAfterImplementation": true,
|
||||
"removedOrDisabledTests": [
|
||||
{
|
||||
"kind": "skip_or_only",
|
||||
"path": "tests/cart.test.ts",
|
||||
"line": 42
|
||||
}
|
||||
],
|
||||
"removedAssertions": 2,
|
||||
"weakenedExpectations": [
|
||||
{
|
||||
"path": "tests/cart.test.ts",
|
||||
"from": "throws InvalidCouponError",
|
||||
"to": "does not throw"
|
||||
}
|
||||
],
|
||||
"snapshotChurnWithoutSourceChange": false,
|
||||
"coverageDelta": -3.8,
|
||||
"mocksReplacingRealBoundary": [
|
||||
{
|
||||
"module": "PaymentGateway",
|
||||
"before": "integration-ish fake",
|
||||
"after": "empty stub"
|
||||
}
|
||||
]
|
||||
},
|
||||
"freshVerifier": {
|
||||
"commandReceiptId": "receipt_<stable hash>",
|
||||
"exitCode": 0,
|
||||
"checkedAfterLastRelevantEdit": true
|
||||
},
|
||||
"decision": "downgraded",
|
||||
"reason": "tests passed, but the tests were weakened after implementation"
|
||||
}
|
||||
```
|
||||
|
||||
## Decisions
|
||||
|
||||
- `accepted` — a verifier command succeeded after the last relevant edit and no
|
||||
integrity downgrade was detected.
|
||||
- `downgraded` — the command succeeded, but the tests/specs changed in a way
|
||||
that makes the verification claim weaker than stated.
|
||||
- `needs_human_review` — the delta may be legitimate, but a local mechanical
|
||||
check cannot safely classify it. Snapshot updates are a common example.
|
||||
|
||||
## Minimal Fixture Suite
|
||||
|
||||
These cases are small enough to live as fixtures without turning Sanhedrin into
|
||||
a correctness judge.
|
||||
|
||||
| Case | Input pattern | Expected decision | Why |
|
||||
| --- | --- | --- | --- |
|
||||
| unchanged-good | implementation changes source; tests unchanged; fresh verifier succeeds | `accepted` | Green tests are supported by a fresh command receipt and unchanged test artifact. |
|
||||
| skipped-test | implementation adds `.skip`, `.only`, `#[ignore]`, or equivalent before verifier succeeds | `downgraded` | The command ran, but the claim no longer represents the original test obligation. |
|
||||
| weakened-assertion | expectation is relaxed after implementation, e.g. `throws InvalidCouponError` -> `does not throw` | `downgraded` | The verifier passed against a weaker assertion than the one available before implementation. |
|
||||
| justified-snapshot | snapshot changes alongside an intentional source/UI change | `needs_human_review` or `accepted` by policy | Snapshot churn can be valid, but the receipt should make the policy decision explicit. |
|
||||
|
||||
## Non-goals
|
||||
|
||||
- Do not infer whether the implementation is correct in the world.
|
||||
- Do not require full semantic diffing before Receipt Lock can operate.
|
||||
- Do not treat staged evidence or a model explanation as equivalent to a fresh
|
||||
command receipt.
|
||||
- Do not block every test edit. The goal is to keep the verification claim
|
||||
honest when the test artifact changed after implementation.
|
||||
Loading…
Add table
Add a link
Reference in a new issue