merge(#99): v2.2 12-tool consolidation + flagship backfill (13 advertised)

Tool Consolidation v2.2.0 (34->12 advertised surface: recall/memory/codebase/ intention/smart_ingest/source_sync/memory_status/maintain/dedup/graph/ session_start/suppress) onto the retroactive-salience-backfill base. Conflict resolution: - server.rs: took #99's canonical 12-tool surface, then re-added as a 13th ADVERTISED tool (ToolDescription + dispatch arm + test). Backfill is the flagship 'Memory with hindsight' (Cai 2024 Nature) cognitive primitive — a distinct capability, not a maintenance op to fold into `maintain` or hide. - Cargo.toml: kept usearch features=["fp16lib"] with the fuller MSVC-C1021 explanation comment (Windows build fix, #71). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-02 22:01:01 +02:00 · 2026-06-29 15:03:47 -05:00 · 2026-06-29 15:03:47 -05:00 · 29658d24f6
commit 29658d24f6
parent 459eb4b79f ec8bf27255
11 changed files with 1974 additions and 1248 deletions
--- a/docs/launch/tool-consolidation-v2.2.0.md
+++ b/docs/launch/tool-consolidation-v2.2.0.md
@ -0,0 +1,135 @@
+# Tool Consolidation v2.2.0
+
+> Reduce the Vestige MCP tool surface so an agent can reliably pick the right
+> tool, then make the few always-on tools deterministic. Two layers: Layer 1
+> (this release) collapses 34 advertised tools to 12; Layer 2 (follow-up) shrinks
+> the *default* surface and enforces the memory loop with hooks.
+
+## Why (frontier evidence)
+
+More advertised tools actively degrade tool selection — the 30 tools an agent
+ignores make the 5 it uses harder to choose:
+
+- **RAG-MCP** (arXiv 2505.03275): selection accuracy collapses 43% → 14% when the
+  full tool catalog is dumped into context; stays >90% under ~30 tools.
+- **Anthropic tool-deferral**: deferring tool schemas moved Opus 4 from 49% → 74%
+  on a tool-heavy benchmark.
+- **GitHub Copilot**: 40 → 13 tools gave +2–5pp accuracy and −400ms latency.
+- **OpenAI** guidance: aim for <20 functions visible at the start of a turn.
+- **RoTBench** (2401.08326): tool *names* are load-bearing — renaming drops GPT-4
+  80 → 58. So renames are deliberate and every old name keeps working.
+
+Vestige had **34** advertised tools. This is the correction.
+
+## Layer 1 — Count reduction (THIS RELEASE): 34 → 12 advertised
+
+Principle: **one consolidation per commit, one change per submission.** Each
+consolidation is its own commit, landed in a safe order with the hot retrieval
+path touched last. Every old tool name remains a hidden `warn!` + redirect alias
+for at least one minor release (so existing `.mcp.json` configs, hooks, and agent
+habits keep working) and is removed in **v2.3.0**.
+
+### Safe order (as committed)
+
+| # | Commit | Folds | Into | Count |
+|---|--------|-------|------|------:|
+| 1 | `dedup` | find_duplicates + merge_candidates + plan_merge + plan_supersede + apply_plan + merge_undo + protect + merge_policy (8) | `dedup` | 34 → 27 |
+| 2 | `session_start` | session_context (rename) | `session_start` | 27 |
+| 3a | `memory_status` | system_status + memory_health + memory_timeline + memory_changelog (4) | `memory_status` | 27 → 24 |
+| 3b | `graph` | explore_connections + predict + memory_graph + composed_graph (4) | `graph` | 24 → 21 |
+| 4 | `maintain` | consolidate + dream + gc + importance_score + backup + export + restore (7) | `maintain` | 21 → 15 |
+| 5 | `recall` | search + deep_reference + cross_reference + contradictions (4) | `recall` | 15 → 12 |
+
+`recall` is committed **last** because it is the hot path.
+
+### Final advertised surface (12)
+
+| Standalone (6) | Consolidated (6) |
+|---|---|
+| `smart_ingest` | `recall` |
+| `memory` | `dedup` |
+| `codebase` | `memory_status` |
+| `intention` | `graph` |
+| `source_sync` | `maintain` |
+| `suppress` | `session_start` |
+
+### Action / mode / view maps
+
+- **`recall`** — `mode`: `lookup` (default) · `reason` · `contradictions`
+- **`dedup`** — `action`: `scan` (default) · `plan_merge` · `plan_supersede` · `apply` · `undo` · `protect` · `policy`
+- **`memory_status`** — `view`: `health` (default) · `retention` · `timeline` · `changelog`
+- **`graph`** — `action`: `chain` · `associations` · `bridges` · `predict` · `memory_graph` · `recent` · `get` · `memory` · `neighbors` · `never_composed` · `bounty_mode` · `label`
+- **`maintain`** — `action`: `consolidate` · `dream` · `gc` · `importance_score` · `backup` · `export` · `restore`
+
+### Resolved design decisions
+
+- **`search` is folded, not kept standalone.** `recall` with no `mode` (the
+  default) *is* search — a zero-overhead pass-through to `search_unified`. Keeping
+  both `search` and `recall` advertised would be the exact RAG-MCP anti-pattern.
+  Final count is a clean **12**, leaving 2 slots of headroom toward a future
+  always-on `save` surface rather than spending them on a redundant verb.
+- **`graph` actions are flat peers, not nested.** `explore`'s `chain` /
+  `associations` / `bridges` sit alongside `predict` / `memory_graph` /
+  `composed_graph` actions in a single `action` enum — matching the existing
+  `memory` / `codebase` flat-action convention and avoiding a translation layer.
+
+### Invariants preserved (with the test that proves each)
+
+- **bitemporal-never-delete** (`dedup`): plan → apply → undo, confirm-gating, and
+  invalidation-not-deletion delegate to `merge::execute` verbatim.
+- **`system_status` response shape** (`memory_status` view=`health`): byte-for-byte
+  — `test_default_view_is_health`.
+- **`gc` dry-run default** + **`restore` path-confinement** (`maintain`):
+  `test_maintain_actions_and_safety`.
+- **`recall` lookup = search, no reasoning cost** (hot path):
+  `test_recall_lookup_matches_search_shape`.
+- **Dashboard events** (consolidate/dream/importance_score Started + Completed,
+  SearchPerformed): preserved by re-emitting in the new dispatch arms and by
+  `emit_tool_event` normalizing the unified tool name to its effective sub-action.
+
+### Result-size annotations (moved with their tools)
+
+`memory_timeline` (200k) → `memory_status`; `search` (300k) → `recall`; new
+`dedup` 150k and `graph` 250k. Kept in sync across the annotation loop, the
+`expected_max_result_size` helper, and both annotation guard tests.
+
+### Deprecation timeline
+
+Aliases `warn!` in v2.2.x and are hard-removed in **v2.3.0**. Full alias list (31
+names) lives in the dispatch redirects in `crates/vestige-mcp/src/server.rs`.
+
+## Layer 2 — Default-surface + hooks (FOLLOW-UP, NOT in v2.2.0)
+
+Count reduction is necessary but not sufficient: what matters most is how few
+tools are visible *at the start of a turn*, plus making the memory loop fire
+deterministically instead of hoping the model remembers.
+
+- **Tiny always-on surface (~3)**: `recall` @ session start, `save` (=`smart_ingest`)
+  @ session end, `recall` on-demand for facts. Everything else (`dedup`, `graph`,
+  `maintain`, `memory_status`, …) deferred off the default surface, loaded on
+  demand.
+- **Deterministic hooks**: a `SessionStart` hook fires `recall`; a `Stop` hook
+  fires `save` (async, fire-and-forget — synchronous heavy work in `Stop` causes
+  loops + per-turn lag). "If the model fails to save, it's gone" — move save out
+  of the model hot loop.
+- This is what turns 12-advertised into ~3-default. Status: **design guidance
+  only; no code in v2.2.0.**
+
+## Verification
+
+Per-commit gates (all green for every commit):
+
+```sh
+cargo test --workspace --no-fail-fast
+cargo clippy --workspace -- -D warnings
+```
+
+Release gates before tagging v2.2.0:
+
+```sh
+pnpm --filter @vestige/dashboard check
+pnpm --filter @vestige/dashboard build
+```
+
+Plus a `tools/list` smoke check asserting exactly **12** advertised names
+(`test_tools_list_returns_all_tools`).