Adds docs/launch/tool-consolidation-v2.2.0.md — the single sequenced plan that
reconciles the two prior planning notes:
- Layer 1 (this PR): 34 → 12 advertised tools, safe commit order, alias policy,
preserved invariants, and the test that proves each.
- Layer 2 (follow-up): tiny always-on default surface + SessionStart/Stop hooks.
Also refreshes stale in-code comments to match the consolidated surface:
- server.rs handle_tools_list header (was "v2.1.21: 25 tools") and the
size-annotation rationale (now lists recall/memory_status/dedup/graph).
- tools/mod.rs module doc (the facade vs. granular-handler relationship).
No behavior change. Gates: cargo test --workspace, cargo clippy -D warnings,
pnpm dashboard check + build — all green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
6.5 KiB
Tool Consolidation v2.2.0
Reduce the Vestige MCP tool surface so an agent can reliably pick the right tool, then make the few always-on tools deterministic. Two layers: Layer 1 (this release) collapses 34 advertised tools to 12; Layer 2 (follow-up) shrinks the default surface and enforces the memory loop with hooks.
Why (frontier evidence)
More advertised tools actively degrade tool selection — the 30 tools an agent ignores make the 5 it uses harder to choose:
- RAG-MCP (arXiv 2505.03275): selection accuracy collapses 43% → 14% when the full tool catalog is dumped into context; stays >90% under ~30 tools.
- Anthropic tool-deferral: deferring tool schemas moved Opus 4 from 49% → 74% on a tool-heavy benchmark.
- GitHub Copilot: 40 → 13 tools gave +2–5pp accuracy and −400ms latency.
- OpenAI guidance: aim for <20 functions visible at the start of a turn.
- RoTBench (2401.08326): tool names are load-bearing — renaming drops GPT-4 80 → 58. So renames are deliberate and every old name keeps working.
Vestige had 34 advertised tools. This is the correction.
Layer 1 — Count reduction (THIS RELEASE): 34 → 12 advertised
Principle: one consolidation per commit, one change per submission. Each
consolidation is its own commit, landed in a safe order with the hot retrieval
path touched last. Every old tool name remains a hidden warn! + redirect alias
for at least one minor release (so existing .mcp.json configs, hooks, and agent
habits keep working) and is removed in v2.3.0.
Safe order (as committed)
| # | Commit | Folds | Into | Count |
|---|---|---|---|---|
| 1 | dedup |
find_duplicates + merge_candidates + plan_merge + plan_supersede + apply_plan + merge_undo + protect + merge_policy (8) | dedup |
34 → 27 |
| 2 | session_start |
session_context (rename) | session_start |
27 |
| 3a | memory_status |
system_status + memory_health + memory_timeline + memory_changelog (4) | memory_status |
27 → 24 |
| 3b | graph |
explore_connections + predict + memory_graph + composed_graph (4) | graph |
24 → 21 |
| 4 | maintain |
consolidate + dream + gc + importance_score + backup + export + restore (7) | maintain |
21 → 15 |
| 5 | recall |
search + deep_reference + cross_reference + contradictions (4) | recall |
15 → 12 |
recall is committed last because it is the hot path.
Final advertised surface (12)
| Standalone (6) | Consolidated (6) |
|---|---|
smart_ingest |
recall |
memory |
dedup |
codebase |
memory_status |
intention |
graph |
source_sync |
maintain |
suppress |
session_start |
Action / mode / view maps
recall—mode:lookup(default) ·reason·contradictionsdedup—action:scan(default) ·plan_merge·plan_supersede·apply·undo·protect·policymemory_status—view:health(default) ·retention·timeline·changeloggraph—action:chain·associations·bridges·predict·memory_graph·recent·get·memory·neighbors·never_composed·bounty_mode·labelmaintain—action:consolidate·dream·gc·importance_score·backup·export·restore
Resolved design decisions
searchis folded, not kept standalone.recallwith nomode(the default) is search — a zero-overhead pass-through tosearch_unified. Keeping bothsearchandrecalladvertised would be the exact RAG-MCP anti-pattern. Final count is a clean 12, leaving 2 slots of headroom toward a future always-onsavesurface rather than spending them on a redundant verb.graphactions are flat peers, not nested.explore'schain/associations/bridgessit alongsidepredict/memory_graph/composed_graphactions in a singleactionenum — matching the existingmemory/codebaseflat-action convention and avoiding a translation layer.
Invariants preserved (with the test that proves each)
- bitemporal-never-delete (
dedup): plan → apply → undo, confirm-gating, and invalidation-not-deletion delegate tomerge::executeverbatim. system_statusresponse shape (memory_statusview=health): byte-for-byte —test_default_view_is_health.gcdry-run default +restorepath-confinement (maintain):test_maintain_actions_and_safety.recalllookup = search, no reasoning cost (hot path):test_recall_lookup_matches_search_shape.- Dashboard events (consolidate/dream/importance_score Started + Completed,
SearchPerformed): preserved by re-emitting in the new dispatch arms and by
emit_tool_eventnormalizing the unified tool name to its effective sub-action.
Result-size annotations (moved with their tools)
memory_timeline (200k) → memory_status; search (300k) → recall; new
dedup 150k and graph 250k. Kept in sync across the annotation loop, the
expected_max_result_size helper, and both annotation guard tests.
Deprecation timeline
Aliases warn! in v2.2.x and are hard-removed in v2.3.0. Full alias list (31
names) lives in the dispatch redirects in crates/vestige-mcp/src/server.rs.
Layer 2 — Default-surface + hooks (FOLLOW-UP, NOT in v2.2.0)
Count reduction is necessary but not sufficient: what matters most is how few tools are visible at the start of a turn, plus making the memory loop fire deterministically instead of hoping the model remembers.
- Tiny always-on surface (~3):
recall@ session start,save(=smart_ingest) @ session end,recallon-demand for facts. Everything else (dedup,graph,maintain,memory_status, …) deferred off the default surface, loaded on demand. - Deterministic hooks: a
SessionStarthook firesrecall; aStophook firessave(async, fire-and-forget — synchronous heavy work inStopcauses loops + per-turn lag). "If the model fails to save, it's gone" — move save out of the model hot loop. - This is what turns 12-advertised into ~3-default. Status: design guidance only; no code in v2.2.0.
Verification
Per-commit gates (all green for every commit):
cargo test --workspace --no-fail-fast
cargo clippy --workspace -- -D warnings
Release gates before tagging v2.2.0:
pnpm --filter @vestige/dashboard check
pnpm --filter @vestige/dashboard build
Plus a tools/list smoke check asserting exactly 12 advertised names
(test_tools_list_returns_all_tools).