mirror of https://github.com/samvallad33/vestige.git synced 2026-07-02 22:01:01 +02:00

Sam Valladares ec8bf27255 docs(mcp): add reconciled two-layer tool-consolidation plan; refresh stale comments

Adds docs/launch/tool-consolidation-v2.2.0.md — the single sequenced plan that
reconciles the two prior planning notes:
  - Layer 1 (this PR): 34 → 12 advertised tools, safe commit order, alias policy,
    preserved invariants, and the test that proves each.
  - Layer 2 (follow-up): tiny always-on default surface + SessionStart/Stop hooks.

Also refreshes stale in-code comments to match the consolidated surface:
  - server.rs handle_tools_list header (was "v2.1.21: 25 tools") and the
    size-annotation rationale (now lists recall/memory_status/dedup/graph).
  - tools/mod.rs module doc (the facade vs. granular-handler relationship).

No behavior change. Gates: cargo test --workspace, cargo clippy -D warnings,
pnpm dashboard check + build — all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-28 18:16:16 -05:00

6.5 KiB

Raw Blame History

Tool Consolidation v2.2.0

Reduce the Vestige MCP tool surface so an agent can reliably pick the right tool, then make the few always-on tools deterministic. Two layers: Layer 1 (this release) collapses 34 advertised tools to 12; Layer 2 (follow-up) shrinks the default surface and enforces the memory loop with hooks.

Why (frontier evidence)

More advertised tools actively degrade tool selection — the 30 tools an agent ignores make the 5 it uses harder to choose:

RAG-MCP (arXiv 2505.03275): selection accuracy collapses 43% → 14% when the full tool catalog is dumped into context; stays >90% under ~30 tools.
Anthropic tool-deferral: deferring tool schemas moved Opus 4 from 49% → 74% on a tool-heavy benchmark.
GitHub Copilot: 40 → 13 tools gave +2–5pp accuracy and −400ms latency.
OpenAI guidance: aim for <20 functions visible at the start of a turn.
RoTBench (2401.08326): tool names are load-bearing — renaming drops GPT-4 80 → 58. So renames are deliberate and every old name keeps working.

Vestige had 34 advertised tools. This is the correction.

Layer 1 — Count reduction (THIS RELEASE): 34 → 12 advertised

Principle: one consolidation per commit, one change per submission. Each consolidation is its own commit, landed in a safe order with the hot retrieval path touched last. Every old tool name remains a hidden warn! + redirect alias for at least one minor release (so existing .mcp.json configs, hooks, and agent habits keep working) and is removed in v2.3.0.

Safe order (as committed)

#	Commit	Folds	Into	Count
1	`dedup`	find_duplicates + merge_candidates + plan_merge + plan_supersede + apply_plan + merge_undo + protect + merge_policy (8)	`dedup`	34 → 27
2	`session_start`	session_context (rename)	`session_start`	27
3a	`memory_status`	system_status + memory_health + memory_timeline + memory_changelog (4)	`memory_status`	27 → 24
3b	`graph`	explore_connections + predict + memory_graph + composed_graph (4)	`graph`	24 → 21
4	`maintain`	consolidate + dream + gc + importance_score + backup + export + restore (7)	`maintain`	21 → 15
5	`recall`	search + deep_reference + cross_reference + contradictions (4)	`recall`	15 → 12

recall is committed last because it is the hot path.

Final advertised surface (12)

Standalone (6)	Consolidated (6)
`smart_ingest`	`recall`
`memory`	`dedup`
`codebase`	`memory_status`
`intention`	`graph`
`source_sync`	`maintain`
`suppress`	`session_start`

Action / mode / view maps

recall — mode: lookup (default) · reason · contradictions
dedup — action: scan (default) · plan_merge · plan_supersede · apply · undo · protect · policy
memory_status — view: health (default) · retention · timeline · changelog
graph — action: chain · associations · bridges · predict · memory_graph · recent · get · memory · neighbors · never_composed · bounty_mode · label
maintain — action: consolidate · dream · gc · importance_score · backup · export · restore

Resolved design decisions

search is folded, not kept standalone. recall with no mode (the default) is search — a zero-overhead pass-through to search_unified. Keeping both search and recall advertised would be the exact RAG-MCP anti-pattern. Final count is a clean 12, leaving 2 slots of headroom toward a future always-on save surface rather than spending them on a redundant verb.
graph actions are flat peers, not nested. explore's chain / associations / bridges sit alongside predict / memory_graph / composed_graph actions in a single action enum — matching the existing memory / codebase flat-action convention and avoiding a translation layer.

Invariants preserved (with the test that proves each)

bitemporal-never-delete (dedup): plan → apply → undo, confirm-gating, and invalidation-not-deletion delegate to merge::execute verbatim.
system_status response shape (memory_status view=health): byte-for-byte — test_default_view_is_health.
gc dry-run default + restore path-confinement (maintain): test_maintain_actions_and_safety.
recall lookup = search, no reasoning cost (hot path): test_recall_lookup_matches_search_shape.
Dashboard events (consolidate/dream/importance_score Started + Completed, SearchPerformed): preserved by re-emitting in the new dispatch arms and by emit_tool_event normalizing the unified tool name to its effective sub-action.

Result-size annotations (moved with their tools)

memory_timeline (200k) → memory_status; search (300k) → recall; new dedup 150k and graph 250k. Kept in sync across the annotation loop, the expected_max_result_size helper, and both annotation guard tests.

Deprecation timeline

Aliases warn! in v2.2.x and are hard-removed in v2.3.0. Full alias list (31 names) lives in the dispatch redirects in crates/vestige-mcp/src/server.rs.

Layer 2 — Default-surface + hooks (FOLLOW-UP, NOT in v2.2.0)

Count reduction is necessary but not sufficient: what matters most is how few tools are visible at the start of a turn, plus making the memory loop fire deterministically instead of hoping the model remembers.

Tiny always-on surface (~3): recall @ session start, save (=smart_ingest) @ session end, recall on-demand for facts. Everything else (dedup, graph, maintain, memory_status, …) deferred off the default surface, loaded on demand.
Deterministic hooks: a SessionStart hook fires recall; a Stop hook fires save (async, fire-and-forget — synchronous heavy work in Stop causes loops + per-turn lag). "If the model fails to save, it's gone" — move save out of the model hot loop.
This is what turns 12-advertised into ~3-default. Status: design guidance only; no code in v2.2.0.

Verification

Per-commit gates (all green for every commit):

cargo test --workspace --no-fail-fast
cargo clippy --workspace -- -D warnings

Release gates before tagging v2.2.0:

pnpm --filter @vestige/dashboard check
pnpm --filter @vestige/dashboard build

Plus a tools/list smoke check asserting exactly 12 advertised names (test_tools_list_returns_all_tools).

6.5 KiB Raw Blame History Unescape Escape