mirror of
https://github.com/samvallad33/vestige.git
synced 2026-07-02 22:01:01 +02:00
docs(demo): full run-it-yourself README + unify failure detection
demo/README.md: the complete self-serve demo artifact — one-command run, the seeded scenario explained, a "build your own scenario" section, the honest boundary (won't invent a cause; can't reach a cause that was never recorded), the Nature citation + the "field admits this is unsolved" sources, and the recording playbook + paste-ready caption. Writing/testing the README surfaced a real inconsistency, now fixed: - The CLI's failure-finder used a hardcoded content-only marker subset and ignored tags, so a "Checkout latency spiked" memory (regression tag, no crash word in content) was never picked as the failure. The CLI now calls the SAME public `looks_like_failure` (content + tags, full list) the backfill tool uses — one definition, no drift. - Extended FAILURE_MARKERS with performance/degradation failures (spiked, latency, degraded, slow, hang, throttled, oom, 502/503/504, flaky, ...) so the feature backfills from perf regressions, not just hard crashes. clippy clean; 527 core + 453 mcp tests; both the main demo and the README's custom scenario verified end-to-end. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
988a31c207
commit
561b2301db
4 changed files with 171 additions and 8 deletions
|
|
@ -53,6 +53,9 @@ pub const FAILURE_MARKERS: &[&str] = &[
|
|||
"error", "bug", "crash", "crashed", "regression", "broke", "broken",
|
||||
"failure", "failed", "panic", "exception", "fault", "outage", "incident",
|
||||
"500", "timeout", "deadlock", "leak", "corrupt", "stack overflow",
|
||||
// performance/degradation failures (an agent should backfill from these too)
|
||||
"spiked", "latency", "degraded", "slow", "hang", "hung", "throttled",
|
||||
"oom", "502", "503", "504", "rejected", "denied", "flaky",
|
||||
];
|
||||
|
||||
/// How strongly to promote the backfilled cause: multiply its stability by this
|
||||
|
|
|
|||
|
|
@ -2610,18 +2610,17 @@ fn run_backfill(
|
|||
}
|
||||
|
||||
// Resolve the failure text up front (used by the contrast baseline).
|
||||
// Use the SAME failure detector the backfill tool uses (content + tags, full
|
||||
// marker list) so the CLI's pick and the tool's pick never diverge.
|
||||
let failure_text: Option<String> = match &failure_id {
|
||||
Some(id) => storage.get_node(id).ok().flatten().map(|n| n.content),
|
||||
None => storage
|
||||
.get_all_nodes(500, 0)
|
||||
.ok()
|
||||
.and_then(|nodes| {
|
||||
nodes.into_iter().find(|n| {
|
||||
let hay = n.content.to_lowercase();
|
||||
["error", "crash", "500", "failed", "panic", "regression", "bug"]
|
||||
.iter()
|
||||
.any(|m| hay.contains(m))
|
||||
})
|
||||
nodes
|
||||
.into_iter()
|
||||
.find(vestige_mcp::tools::backfill::looks_like_failure)
|
||||
})
|
||||
.map(|n| n.content),
|
||||
};
|
||||
|
|
|
|||
|
|
@ -91,8 +91,10 @@ fn extract_entities(node: &KnowledgeNode) -> Vec<String> {
|
|||
set.into_iter().collect()
|
||||
}
|
||||
|
||||
/// Heuristic: does this memory read like a failure/"aversive event"?
|
||||
fn looks_like_failure(node: &KnowledgeNode) -> bool {
|
||||
/// Heuristic: does this memory read like a failure/"aversive event"? Checks both
|
||||
/// content AND tags against the full FAILURE_MARKERS list. Public so the CLI and
|
||||
/// any caller share ONE failure-detection definition (no drifting subsets).
|
||||
pub fn looks_like_failure(node: &KnowledgeNode) -> bool {
|
||||
let hay = node.content.to_lowercase();
|
||||
vestige_core::advanced::retroactive_backfill::FAILURE_MARKERS
|
||||
.iter()
|
||||
|
|
|
|||
159
demo/README.md
Normal file
159
demo/README.md
Normal file
|
|
@ -0,0 +1,159 @@
|
|||
# Postdict — memory with hindsight
|
||||
|
||||
> Every other AI memory finds what your bug **looks like**.
|
||||
> This one finds what **caused** it — even when the cause was days ago
|
||||
> and looks nothing like the crash.
|
||||
|
||||
When your agent hits a failure **today**, Postdict reaches **backward in time**
|
||||
and surfaces the quiet earlier change that caused it — the root cause a vector /
|
||||
semantic search **structurally cannot** find, because it isn't *similar* to the
|
||||
error, only *causally upstream* of it.
|
||||
|
||||
It's a faithful port of a 2024 *Nature* neuroscience result: the brain links a
|
||||
later salient event to an earlier quiet memory **backward in time** ("fear links
|
||||
retrospectively, not prospectively"). Root cause is always in the past — so the
|
||||
backward-only direction isn't a metaphor, it's the correct behavior.
|
||||
|
||||
---
|
||||
|
||||
## ▶️ Run the demo (30 seconds, one command)
|
||||
|
||||
```sh
|
||||
# from the repo root
|
||||
cargo build -p vestige-mcp --bin vestige --release # first time only (~1 min)
|
||||
./demo/postdict-demo.sh
|
||||
```
|
||||
|
||||
That's it. It uses a **fresh throwaway database** (a temp dir, deleted on exit) —
|
||||
it touches nothing else on your machine. No API keys, no network, fully local.
|
||||
|
||||
**Pacing:** the script pauses ~1.4s between beats for a clean screen-recording.
|
||||
- `PAUSE=0 ./demo/postdict-demo.sh` — instant (no pauses)
|
||||
- `PAUSE=2 ./demo/postdict-demo.sh` — slower, more dramatic
|
||||
|
||||
---
|
||||
|
||||
## What you'll see
|
||||
|
||||
The demo plants three memories into an empty store, then asks one question:
|
||||
|
||||
| When | Memory | Note |
|
||||
|---|---|---|
|
||||
| **3 days ago** | `Set API_TIMEOUT=2 in the deploy env to speed up cold starts` | the quiet cause. boring. forgotten. |
|
||||
| **20 days ago** | `A 500 Internal Server Error happened in the billing service` | a lookalike — *resembles* today's crash |
|
||||
| **today** 💥 | `Service crashed: 500 Internal Server Error on the auth endpoint` | the failure |
|
||||
|
||||
Then it runs the same question through both engines, side by side:
|
||||
|
||||
```
|
||||
── 1. SIMILARITY SEARCH · keyword (BM25) ──
|
||||
1. A 500 Internal Server Error happened in the billing service ← top match
|
||||
→ ranked by RESEMBLANCE. its top hit is a lookalike, not the cause.
|
||||
|
||||
── 2. POSTDICT (reach backward for the CAUSE) ──
|
||||
#1 Set API_TIMEOUT=2 in the deploy env to speed up cold starts
|
||||
↩ reached back 3.0 days before the failure
|
||||
🔗 causal join: api_timeout
|
||||
✅ promoted — it will resurface next time
|
||||
```
|
||||
|
||||
**Similarity search confidently returns the lookalike.** It's wrong.
|
||||
**Postdict reaches back 3 days and finds the real cause** — by the shared
|
||||
`API_TIMEOUT` entity, backward in time. Then it promotes that memory so it
|
||||
stops decaying and surfaces next time.
|
||||
|
||||
> The label says exactly which engine ran (`keyword (BM25)` here; it becomes
|
||||
> `semantic (vector + BM25 hybrid)` once embeddings are generated). No
|
||||
> sleight of hand — it's the real search every other memory tool does.
|
||||
|
||||
---
|
||||
|
||||
## Try your own scenario (the "it's not staged" proof)
|
||||
|
||||
Nothing here is hardcoded. Build any history and ask:
|
||||
|
||||
```sh
|
||||
DB=$(mktemp -d)/db
|
||||
BIN=./target/release/vestige
|
||||
|
||||
# plant a quiet cause N days ago (--ago-days backdates it)
|
||||
$BIN --data-dir "$DB" ingest "Disabled the checkout cache while debugging" \
|
||||
--tags checkout --node-type decision --ago-days 4
|
||||
|
||||
# record a failure today that shares an entity (here: 'checkout')
|
||||
$BIN --data-dir "$DB" ingest "Checkout latency spiked to 9s after deploy" \
|
||||
--tags checkout,latency,regression --node-type event
|
||||
|
||||
# reach backward for the cause, with the similarity contrast
|
||||
$BIN --data-dir "$DB" backfill --contrast
|
||||
```
|
||||
|
||||
The cause and the failure share the `checkout` entity but are **not textually
|
||||
similar** — so semantic search misses it and Postdict finds it.
|
||||
|
||||
`vestige backfill --help` for all flags (`--manual`, `--lookback-days`,
|
||||
`--failure-id`, `--no-promote`).
|
||||
|
||||
---
|
||||
|
||||
## The honest boundary (read this — it's the point)
|
||||
|
||||
- **If the upstream change was never recorded, nothing can reach it.** Postdict
|
||||
reaches back through *memory*, not magic. No memory of the cause → no backfill.
|
||||
- **It links by shared entities** (same file / env var / service / symbol),
|
||||
backward in time — *not* by semantic similarity. That's deliberate: similarity
|
||||
is exactly the blind spot every other memory already has.
|
||||
- **It won't invent a cause.** No shared entity between the failure and an
|
||||
earlier memory → no link. It would rather say nothing than fabricate.
|
||||
- The "promote" step boosts the cause's retention so it resurfaces; it never
|
||||
deletes or rewrites anything (bi-temporal — old memories stay queryable).
|
||||
|
||||
---
|
||||
|
||||
## How it works (for the skeptics)
|
||||
|
||||
1. **Trigger.** A memory lands that reads like a failure (high surprise +
|
||||
failure markers like `error`/`crash`/`500`), or you mark one manually.
|
||||
2. **Backward reach.** Postdict scans memories *older* than the failure that
|
||||
share an entity with it (the causal join), within a lookback window.
|
||||
3. **Rank by cause, not resemblance.** Candidates are scored by shared-entity
|
||||
strength and *dissimilarity* — the less a candidate resembles the failure,
|
||||
the more valuable it is, because that's precisely what a vector search can't
|
||||
surface.
|
||||
4. **Promote.** The surfaced cause's FSRS retention is boosted so it stops
|
||||
decaying and is there next time.
|
||||
|
||||
The mechanism, tests, and the *Nature* citation live in
|
||||
[`crates/vestige-core/src/advanced/retroactive_backfill.rs`](../crates/vestige-core/src/advanced/retroactive_backfill.rs).
|
||||
The field itself admits this is unsolved: causal + temporal retrieval is
|
||||
"largely unexplored" (mem0, *State of AI Agent Memory 2026*), and frontier
|
||||
models fail at cloud root-cause analysis ([arXiv:2602.09937](https://arxiv.org/abs/2602.09937)).
|
||||
This is the first memory that does it.
|
||||
|
||||
---
|
||||
|
||||
## Recording the demo (for a clean clip)
|
||||
|
||||
1. Make your terminal large, dark theme, ~16pt mono font. Clear scrollback.
|
||||
2. macOS: `Cmd-Shift-5` → record a tight region around the terminal (or the
|
||||
whole window). QuickTime works too.
|
||||
3. Run `./demo/postdict-demo.sh` (default pacing) — or `PAUSE=2` for a slower,
|
||||
more cinematic take.
|
||||
4. The single "hold here" frame: when **POSTDICT** resolves the `#1` cause with
|
||||
`↩ reached back 3.0 days` — that's the money shot. Let it sit.
|
||||
5. Trim to ~30s. Muted + captions plays best in feeds.
|
||||
|
||||
**Caption / hook (ready to paste):**
|
||||
|
||||
> Your crash is today. The cause was an env-var edit 3 days ago — it looks
|
||||
> *nothing* like the error, so vector search will never surface it. Same query,
|
||||
> split screen: similarity search returns the lookalike, Postdict reaches back
|
||||
> and finds the cause. Seed's in the repo — run it yourself.
|
||||
|
||||
**Pinned honest-boundary reply:** *"Where it doesn't work: if the upstream change
|
||||
was never recorded, nothing can reach it. Everything in the clip is in the
|
||||
seeded repo."*
|
||||
|
||||
---
|
||||
|
||||
*Local-first. No API keys. No data leaves your machine.*
|
||||
Loading…
Add table
Add a link
Reference in a new issue