From 5148f5dfefb32ab462e278b917db448c72eb964b Mon Sep 17 00:00:00 2001 From: CREDO23 Date: Thu, 30 Apr 2026 12:05:49 +0200 Subject: [PATCH] Add a prompt tuning playbook for multi-agent quality. --- .../PROMPT_TUNING_PLAYBOOK.md | 159 ++++++++++++++++++ 1 file changed, 159 insertions(+) create mode 100644 surfsense_backend/app/agents/multi_agent_chat/PROMPT_TUNING_PLAYBOOK.md diff --git a/surfsense_backend/app/agents/multi_agent_chat/PROMPT_TUNING_PLAYBOOK.md b/surfsense_backend/app/agents/multi_agent_chat/PROMPT_TUNING_PLAYBOOK.md new file mode 100644 index 000000000..db8e0b495 --- /dev/null +++ b/surfsense_backend/app/agents/multi_agent_chat/PROMPT_TUNING_PLAYBOOK.md @@ -0,0 +1,159 @@ +# Multi-Agent Prompt Tuning Playbook + +This playbook defines how to tune `multi_agent_chat` prompts for better outcomes than `new_chat` on delegation quality, lower confusion, and stable tool behavior. + +It is intentionally architecture-aware: this system is a **supervisor + expert tools** pattern, not a single flat tool agent. + +## Why this matters in our architecture + +- The supervisor only sees **routing tools** (e.g. `research`, `gmail`, `calendar`), not low-level connector APIs. +- Experts are invoked through `routing/from_domain_agents.py` and receive a single natural-language task via `compose_child_task(...)`. +- Because expert context is compact and delegated, prompt quality is the primary control lever for routing accuracy and downstream tool behavior. + +## Authoritative guidance we should follow + +- Anthropic prompt engineering best practices (role clarity, XML structure, explicit tool-use policy, few-shot examples): [Anthropic docs](https://docs.anthropic.com/en/docs/use-xml-tags) +- OpenAI function-calling reliability guidance (clear tool descriptions, when/when-not tool usage, small callable surface): [OpenAI function calling guide](https://developers.openai.com/docs/guides/function-calling) +- OpenAI prompt engineering (instruction hierarchy and explicit output contracts): [OpenAI prompt engineering guide](https://developers.openai.com/api/docs/guides/prompt-engineering) +- LangChain supervisor/subagents guidance (clear subagent names/descriptions, context engineering, routing intent): [LangChain supervisor docs](https://docs.langchain.com/oss/python/langchain/supervisor), [LangChain subagents docs](https://docs.langchain.com/oss/python/langchain/multi-agent/subagents) + +## Current weakness audit (as of now) + +- `supervisor/supervisor_prompt.md` is short and does not define decision policy for ambiguous/multi-domain tasks. +- Most expert `domain_prompt.md` files are one-line role statements with no: + - scope boundaries and refusal policy, + - parameter-resolution behavior, + - completion criteria (what must be returned), + - failure handling rules, + - concrete examples. +- Tool descriptions in routing are generic ("Pass a clear natural-language task"), which weakens handoff quality. + +## Prompt design standards (required) + +Apply these standards to supervisor and every expert prompt. + +1. **Role + objective first** + - One sentence for identity. + - One sentence for success criterion. + +2. **Explicit routing/usage rules** + - Tell the model when to use this agent/tool. + - Tell it when not to use it. + - Include ambiguity fallback ("ask one clarifying question" or "do X conservative default"). + +3. **Structured task contract** + - Require concise but complete execution reports. + - Require IDs/links/timestamps when tool outputs produce them. + - For no-op paths, explain why no action was taken. + +4. **Safety + reliability contract** + - Never fabricate tool results. + - Never claim action if no successful tool call happened. + - Surface irreversible/risky actions clearly. + +5. **Few-shot examples** + - Include 2-4 minimal examples per domain: + - direct success, + - ambiguous input, + - out-of-scope reroute. + +6. **Concise formatting rules** + - Avoid verbosity. + - Stable response structure improves orchestration and observability. + +## Supervisor prompt blueprint + +The supervisor prompt should contain these sections in order: + +1. `Role` +2. `Available experts` (name + scope + non-scope) +3. `Delegation policy` + - single-domain -> one expert + - multi-domain -> sequence or parallel where independent + - no expert needed -> answer directly +4. `Task-writing policy` for delegated calls + - include user goal, constraints, success criteria + - include only needed context +5. `Result synthesis policy` + - merge expert outputs into one user-facing response + - preserve concrete identifiers from expert outputs +6. `Failure policy` + - retry on recoverable mismatch + - ask clarifying question when required field is missing + +## Expert prompt blueprint (per domain) + +Each `domain_prompt.md` should include: + +1. `Role and scope` +2. `In-scope actions` (mapped to the exact provided tools) +3. `Out-of-scope behavior` (what to return for reroute) +4. `Execution rules` + - choose the minimum tool sequence that satisfies request + - do not guess IDs or parameters + - ask concise clarification only when necessary +5. `Output contract` + - action summary + - concrete artifacts/IDs/links generated + - unresolved items and next step +6. `Examples` (2-4 realistic, short) + +## Domain-specific tuning checklist + +- `research`: enforce source-grounded summaries and explicit uncertainty. +- `memory`: strict save criteria (durable preference/fact only) and secret-handling policy. +- `deliverables`: require output artifact references and constraints echo. +- `gmail` / `calendar`: require recipient/date-time disambiguation policy and timezone handling. +- `docs connectors` (`notion`, `confluence`, `drive`, `dropbox`, `onedrive`): require exact page/file target resolution before mutate actions. +- chat connectors (`discord`, `teams`, `slack`): require channel/thread context clarity before send actions. +- MCP experts: require strict tool-description adherence and no assumption about unavailable endpoints. + +## Tool description tuning rules (routing layer) + +Routing tool descriptions should include: + +- best-fit task types, +- disallowed task types, +- required task payload hints (e.g. "include recipient + intent + constraints"), +- expected result shape. + +This is especially important because supervisor tool choice is heavily influenced by `name + description`. + +## Evaluation plan (before wiring to production) + +Create a prompt eval set with at least 20 tasks: + +- 8 single-domain tasks, +- 6 ambiguous tasks (should clarify or route conservatively), +- 6 multi-domain tasks (should sequence experts correctly). + +Track: + +- routing accuracy, +- unnecessary delegation rate, +- tool-call success rate, +- clarification precision (ask only when needed), +- final answer completeness. + +Use same test set against: + +- current prompts, +- tuned prompts v1, +- tuned prompts v2. + +Promote only when v2 improves routing accuracy and reduces unnecessary delegation with no regression in tool-call success. + +## Immediate implementation plan + +1. Rewrite `supervisor/supervisor_prompt.md` using the supervisor blueprint. +2. Rewrite all expert `domain_prompt.md` files with the expert blueprint. +3. Upgrade routing tool descriptions in `routing/supervisor_routing.py` to add "when to use / when not to use". +4. Add a lightweight prompt eval script or fixture set for reproducible tuning. + +## Definition of done + +- Every supervisor/expert prompt has explicit scope, failure policy, and output contract. +- Every route description encodes clear decision boundaries. +- Prompt eval shows measurable gains on routing accuracy and lower unnecessary delegation. +- Team can iterate prompt versions without changing core orchestration code. +