Tighten supervisor delegation policy and align deliverables wording across prompts.

2026-05-06 22:32:39 +02:00 · 2026-04-30 16:43:44 +02:00 · 2026-04-30 16:43:44 +02:00 · 74337b462a
commit 74337b462a
parent c077939522
6 changed files with 123 additions and 22 deletions
--- a/surfsense_backend/app/agents/multi_agent_chat/CAPABILITY_PARITY_CHECKLIST.md
+++ b/surfsense_backend/app/agents/multi_agent_chat/CAPABILITY_PARITY_CHECKLIST.md
@ -0,0 +1,57 @@
+# Multi-Agent Capability Parity Checklist
+
+This checklist tracks whether `multi_agent_chat` has the required capability coverage
+to be manually tested against `new_chat` in LangSmith.
+
+Legend:
+- `[x]` implemented
+- `[~]` implemented with intentional difference
+- `[ ]` pending
+
+## 1) Prompting
+
+- [x] Supervisor prompt has explicit delegation policy.
+- [x] Supervisor prompt consumes structured expert outputs (`status`, `evidence`, `next_step`, `missing_fields`, `assumptions`).
+- [x] Supervisor available specialist list is dynamically rendered from currently registered tools.
+- [x] All expert prompts are normalized to a shared JSON output contract shape with invariant rules.
+- [x] Memory wording adapts to thread visibility (user vs team).
+- [~] `generic_mcp` specialist prompt exists but route is intentionally disabled.
+
+## 2) Tooling and Routing
+
+- [x] Built-in specialist routes are wired (`research`, `memory`, `deliverables` when eligible).
+- [x] Connector specialist routes are gated by available connector inventory.
+- [x] MCP tools are partitioned and merged into matching specialists.
+- [x] MCP-only named specialists are routed when present (`linear`, `slack`, `jira`, `clickup`, `airtable`).
+- [~] `generic_mcp` route is intentionally disabled by product decision.
+- [x] Delegated child tasks include explicit structured context envelope tags.
+- [x] Domain-agent outputs are parsed and validated as JSON with safe fallback envelope.
+
+## 3) Middleware / Runtime
+
+- [x] Supervisor middleware stack mirrors SurfSense shell used by `new_chat` for core protections.
+- [~] `SubAgentMiddleware` intentionally omitted (multi-agent architecture uses explicit specialists).
+- [~] `PermissionMiddleware` intentionally omitted by decision (route gating used instead).
+- [x] Action-log / compaction / retry / fallback / filesystem / KB middleware are wired for supervisor path.
+- [x] Agent graph compile path uses `asyncio.to_thread` for heavy build operations.
+
+## 4) Entry-Point Wiring
+
+- [x] Authenticated streaming path can route to `create_multi_agent_chat` via feature flag (`MULTI_AGENT_CHAT_ENABLED`).
+- [x] Resume streaming path can route to `create_multi_agent_chat` via feature flag.
+- [~] Authenticated stream falls back to `new_chat` when `disabled_tools` is provided (multi-agent does not yet implement disabled-tool filtering parity).
+- [ ] Anonymous stream path wired to multi-agent (left unchanged for now due anonymous tool allow-list differences).
+
+## 5) Observability and Validation Readiness
+
+- [x] Ready for manual LangSmith trace inspection once `MULTI_AGENT_CHAT_ENABLED=true`.
+- [ ] Formal routing eval harness and benchmark dataset.
+- [ ] Automated regression checks in CI for routing quality.
+
+## 6) Manual Benchmark Readiness Decision
+
+Status: **Ready for manual benchmarking in authenticated flows**.
+
+Before declaring "better than `new_chat`", still required:
+- Build and run formal eval/benchmark harness.
+- Close anonymous-path and disabled-tools parity gaps if they are in benchmark scope.
--- a/surfsense_backend/app/agents/multi_agent_chat/expert_agent/builtins/deliverables/domain_prompt.md
+++ b/surfsense_backend/app/agents/multi_agent_chat/expert_agent/builtins/deliverables/domain_prompt.md
@ -2,7 +2,7 @@ You are the SurfSense deliverables operations sub-agent.
 You receive delegated instructions from a supervisor agent and return structured results for supervisor synthesis.

 <goal>
-Generate high-quality deliverables with explicit constraints and reliable artifact reporting.
+Produce **deliverables**: shareable **artifacts** the user keeps (reports, slide-style video presentations, podcasts, resumes, images). Use explicit constraints and reliable proof of what was generated.
 </goal>

 <available_tools>
--- a/surfsense_backend/app/agents/multi_agent_chat/routing/supervisor_routing.py
+++ b/surfsense_backend/app/agents/multi_agent_chat/routing/supervisor_routing.py
@ -158,7 +158,8 @@ def build_supervisor_routing_tools(
            DomainRoutingSpec(
                tool_name="deliverables",
                description=(
-                    "Use for creating final artifacts: reports, podcasts, video presentations, resumes, and images."
+                    "Use for deliverables and shareable artifacts: generated reports, podcasts, "
+                    "video presentations, resumes, and images—not for routine lookups or single small edits elsewhere."
                ),
                domain_agent=deliverables_agent,
            ),
--- a/surfsense_backend/app/agents/multi_agent_chat/supervisor/graph.py
+++ b/surfsense_backend/app/agents/multi_agent_chat/supervisor/graph.py
@ -18,7 +18,7 @@ _BUILTIN_SPECIALISTS: frozenset[str] = frozenset({"research", "memory", "deliver
 _SPECIALIST_CAPABILITIES: dict[str, str] = {
    "research": "external research: web lookup, source gathering, and SurfSense documentation help.",
    "memory": "save durable long-lived memory items.",
-    "deliverables": "final artifact generation: report, podcast, video presentation, resume, or image.",
+    "deliverables": "deliverables and shareable artifacts: reports, podcasts, video presentations, resumes, and images.",
    "gmail": "email inbox actions: search/read emails, draft updates, send messages, and trash emails.",
    "calendar": "scheduling actions: check availability, inspect events, create events, and update events.",
    "google_drive": "Drive file/document actions: locate files, inspect content, and manage files/folders.",
--- a/surfsense_backend/app/agents/multi_agent_chat/supervisor/supervisor_prompt.md
+++ b/surfsense_backend/app/agents/multi_agent_chat/supervisor/supervisor_prompt.md
@ -15,33 +15,40 @@ Use only the specialists listed below.
 2) Answer directly when no expert tool is needed.
 3) For multi-domain work, decompose into sequential expert calls (or parallel only when independent).
 4) Do not call a specialist "just in case". Every delegation must have a clear purpose.
+5) Specialists are best for **one clear step at a time**—for example “find this,” “show that record,” “make this one change.” Do **not** hand them an entire “analyze everything and write me a trends report” brief in one go.
+6) When the user wants **big-picture synthesis**—patterns across lots of items, comparisons across time, or an executive-style overview—**you** split the work: several **small** asks to whoever actually holds that information (each with a clear cap: how many items, how far back, which fields), then **you** combine the answers into one clear reply. If they need a **deliverable**—a real **artifact** others can read, hear, or watch (report, slide-style video, podcast, resume, image)—delegate to the **deliverables** specialist. Do not ask other specialists to replace that: their job is smaller steps (lookups and targeted changes), not producing the final artifact.
+7) Each specialist answers in a **single short structured reply** (no extra chatter after it). Ask them only for what that reply can reasonably hold. If the user needs a long narrative or full report, **you** combine steps—or use the **deliverables** specialist—not one overloaded ask.
+8) Prefer **a few clear, small asks** over one huge vague ask that invites guessing, cut-off answers, or broken replies.
 </delegation_policy>

 <task_writing_policy>
 When delegating to a specialist, pass a compact but complete task that includes:
- user goal,
- concrete constraints (time range, recipients, format, etc.),
- success criteria,
- required output details (IDs/links/timestamps when applicable).
+- the **outcome** they should produce, in **your own words** as clear instructions (do **not** paste or forward the user’s message verbatim),
+- concrete limits (dates, names, “last N items,” which details matter),
+- how you will judge success,
+- any identifiers or links the user already gave.
+
+When asking for lists or searches, always say **how many** items at most and **which details** you need back.

 Never pass implementation chatter. Pass only actionable instructions.
+Each delegation should sound like **one clear action** (or two that belong together), not a full project brief—unless you are intentionally speaking to **research** or to **deliverables** for a **deliverable artifact** (report, slide-style video, podcast, resume, image).
 </task_writing_policy>

 <expert_output_contract_policy>
-Every specialist call returns one JSON object. Parse and reason over these fields:
- `status`: `success` | `partial` | `blocked` | `error`
- `action_summary`: concise execution summary
- `evidence`: task-specific proof/results
- `next_step`: required follow-up when not fully successful
- `missing_fields`: required user inputs (when blocked by missing info)
- `assumptions`: inferred values used by the expert
+Every specialist returns **one structured reply** in a fixed layout. Treat it like a small form, not prose. It includes:
+- **outcome**: succeeded, partly done, blocked, or failed
+- **short summary** of what they did
+- **proof**: what they actually saw or changed (when relevant)
+- **what to do next** if they are not done
+- **what you must ask the user** if something was missing
+- **what they assumed** if they had to fill a gap

-Field-handling rules:
-1) `status=success`: trust the result only when supported by `evidence`.
-2) `status=partial`: use completed `evidence`, then continue with `next_step`.
-3) `status=blocked`: do not retry blindly; ask the user only for items in `missing_fields` (or clear disambiguation choices from `evidence`).
-4) `status=error`: do not claim completion; either retry with a better task if obvious, or explain failure and propose the expert's `next_step`.
-5) If an expert output appears invalid or contradictory, treat it as `error`, avoid fabricating details, and recover with a safer re-delegation or user clarification.
+How to use it:
+1) **Succeeded**: only treat it as done if the **proof** backs it up.
+2) **Partly done**: use what they proved, then follow their **what to do next**.
+3) **Blocked**: do not blindly retry; ask the user only what they said was missing (or pick from options they listed).
+4) **Failed**: do not pretend it worked; either retry with a clearer small ask or explain honestly and follow their suggested recovery.
+5) If the reply is missing, garbled, or contradicts itself, treat it as failed, do not invent facts, and recover with a safer smaller ask or a question to the user.
 </expert_output_contract_policy>

 <clarification_policy>
@ -55,7 +62,7 @@ After expert calls, produce one coherent final answer:
 - key results/artifacts,
 - unresolved items and the next best step.
 - include assumptions only when they affected outcomes.
- when multiple experts are used, merge outputs into one user-facing narrative (no raw JSON dump).
+- when multiple experts are used, merge outputs into one user-facing narrative (do not paste their raw structured reply verbatim).

-Never claim an action succeeded unless an expert returned success evidence.
+Never claim an action succeeded unless their reply includes proof that matches what you claim.
 </synthesis_policy>