From 74337b462acb0178bbd8e53182acc4675f28e443 Mon Sep 17 00:00:00 2001
From: CREDO23 <bakerathierry@gmail.com>
Date: Thu, 30 Apr 2026 16:43:44 +0200
Subject: [PATCH] Tighten supervisor delegation policy and align deliverables
 wording across prompts.

---
 surfsense_backend/AFTER_WORK_FIXES.md         | 36 ++++++++++++
 .../CAPABILITY_PARITY_CHECKLIST.md            | 57 +++++++++++++++++++
 .../builtins/deliverables/domain_prompt.md    |  2 +-
 .../routing/supervisor_routing.py             |  3 +-
 .../multi_agent_chat/supervisor/graph.py      |  2 +-
 .../supervisor/supervisor_prompt.md           | 45 ++++++++-------
 6 files changed, 123 insertions(+), 22 deletions(-)
 create mode 100644 surfsense_backend/AFTER_WORK_FIXES.md
 create mode 100644 surfsense_backend/app/agents/multi_agent_chat/CAPABILITY_PARITY_CHECKLIST.md

diff --git a/surfsense_backend/AFTER_WORK_FIXES.md b/surfsense_backend/AFTER_WORK_FIXES.md
new file mode 100644
index 000000000..d74b7c7da
--- /dev/null
+++ b/surfsense_backend/AFTER_WORK_FIXES.md
@@ -0,0 +1,36 @@
+# After Work Fixes
+
+## Middleware Risk Flags (new_chat)
+
+These are known "policy/routing via middleware" risks to review later.
+
+1. `FileIntentMiddleware`
+- Risk: `file_write` classification can force `write_file`/`edit_file` and override deliverable or connector tool selection.
+- Example failure: user asks for video/report artifact, agent writes into `/documents/*` instead.
+
+2. `KnowledgePriorityMiddleware`
+- Risk: KB planner and injected priority hints can over-anchor turns to KB reads when connector action is the better path.
+
+3. `KnowledgeTreeMiddleware`
+- Risk: injected workspace tree can bias behavior toward file navigation/writes by default.
+
+4. `SurfSenseFilesystemMiddleware` + `KnowledgeBasePersistenceMiddleware`
+- Risk: mistaken `write_file` actions become persisted NOTE documents in KB, making wrong-path behavior durable.
+
+5. `PermissionMiddleware`
+- Risk: deny/ask rules can hide or block the correct tool, appearing as "model chose wrong tool" when it never had access.
+
+6. Subagent middleware parity (`chat_deepagent.py`)
+- Risk: parent vs subagent stack differences can produce inconsistent behavior across similar tasks.
+
+7. `SpillingContextEditingMiddleware` + compaction
+- Risk: context trimming can remove critical tool evidence and cause wrong retries/tool choices.
+
+8. `ToolCallNameRepairMiddleware`
+- Risk: malformed calls may be auto-repaired to unintended tools in edge cases.
+
+9. `DedupHITLToolCallsMiddleware` / `DoomLoopMiddleware`
+- Risk: legitimate repeated calls can be suppressed or stopped early.
+
+10. `MemoryInjectionMiddleware`
+- Risk: injected memory may bias tool choice away from fresh connector/KB evidence.
diff --git a/surfsense_backend/app/agents/multi_agent_chat/CAPABILITY_PARITY_CHECKLIST.md b/surfsense_backend/app/agents/multi_agent_chat/CAPABILITY_PARITY_CHECKLIST.md
new file mode 100644
index 000000000..b759e3221
--- /dev/null
+++ b/surfsense_backend/app/agents/multi_agent_chat/CAPABILITY_PARITY_CHECKLIST.md
@@ -0,0 +1,57 @@
+# Multi-Agent Capability Parity Checklist
+
+This checklist tracks whether `multi_agent_chat` has the required capability coverage
+to be manually tested against `new_chat` in LangSmith.
+
+Legend:
+- `[x]` implemented
+- `[~]` implemented with intentional difference
+- `[ ]` pending
+
+## 1) Prompting
+
+- [x] Supervisor prompt has explicit delegation policy.
+- [x] Supervisor prompt consumes structured expert outputs (`status`, `evidence`, `next_step`, `missing_fields`, `assumptions`).
+- [x] Supervisor available specialist list is dynamically rendered from currently registered tools.
+- [x] All expert prompts are normalized to a shared JSON output contract shape with invariant rules.
+- [x] Memory wording adapts to thread visibility (user vs team).
+- [~] `generic_mcp` specialist prompt exists but route is intentionally disabled.
+
+## 2) Tooling and Routing
+
+- [x] Built-in specialist routes are wired (`research`, `memory`, `deliverables` when eligible).
+- [x] Connector specialist routes are gated by available connector inventory.
+- [x] MCP tools are partitioned and merged into matching specialists.
+- [x] MCP-only named specialists are routed when present (`linear`, `slack`, `jira`, `clickup`, `airtable`).
+- [~] `generic_mcp` route is intentionally disabled by product decision.
+- [x] Delegated child tasks include explicit structured context envelope tags.
+- [x] Domain-agent outputs are parsed and validated as JSON with safe fallback envelope.
+
+## 3) Middleware / Runtime
+
+- [x] Supervisor middleware stack mirrors SurfSense shell used by `new_chat` for core protections.
+- [~] `SubAgentMiddleware` intentionally omitted (multi-agent architecture uses explicit specialists).
+- [~] `PermissionMiddleware` intentionally omitted by decision (route gating used instead).
+- [x] Action-log / compaction / retry / fallback / filesystem / KB middleware are wired for supervisor path.
+- [x] Agent graph compile path uses `asyncio.to_thread` for heavy build operations.
+
+## 4) Entry-Point Wiring
+
+- [x] Authenticated streaming path can route to `create_multi_agent_chat` via feature flag (`MULTI_AGENT_CHAT_ENABLED`).
+- [x] Resume streaming path can route to `create_multi_agent_chat` via feature flag.
+- [~] Authenticated stream falls back to `new_chat` when `disabled_tools` is provided (multi-agent does not yet implement disabled-tool filtering parity).
+- [ ] Anonymous stream path wired to multi-agent (left unchanged for now due anonymous tool allow-list differences).
+
+## 5) Observability and Validation Readiness
+
+- [x] Ready for manual LangSmith trace inspection once `MULTI_AGENT_CHAT_ENABLED=true`.
+- [ ] Formal routing eval harness and benchmark dataset.
+- [ ] Automated regression checks in CI for routing quality.
+
+## 6) Manual Benchmark Readiness Decision
+
+Status: **Ready for manual benchmarking in authenticated flows**.
+
+Before declaring "better than `new_chat`", still required:
+- Build and run formal eval/benchmark harness.
+- Close anonymous-path and disabled-tools parity gaps if they are in benchmark scope.
diff --git a/surfsense_backend/app/agents/multi_agent_chat/expert_agent/builtins/deliverables/domain_prompt.md b/surfsense_backend/app/agents/multi_agent_chat/expert_agent/builtins/deliverables/domain_prompt.md
index e334a921b..c44f131bb 100644
--- a/surfsense_backend/app/agents/multi_agent_chat/expert_agent/builtins/deliverables/domain_prompt.md
+++ b/surfsense_backend/app/agents/multi_agent_chat/expert_agent/builtins/deliverables/domain_prompt.md
@@ -2,7 +2,7 @@ You are the SurfSense deliverables operations sub-agent.
 You receive delegated instructions from a supervisor agent and return structured results for supervisor synthesis.
 
 <goal>
-Generate high-quality deliverables with explicit constraints and reliable artifact reporting.
+Produce **deliverables**: shareable **artifacts** the user keeps (reports, slide-style video presentations, podcasts, resumes, images). Use explicit constraints and reliable proof of what was generated.
 </goal>
 
 <available_tools>
diff --git a/surfsense_backend/app/agents/multi_agent_chat/routing/supervisor_routing.py b/surfsense_backend/app/agents/multi_agent_chat/routing/supervisor_routing.py
index 91fac9cd5..ef496ab17 100644
--- a/surfsense_backend/app/agents/multi_agent_chat/routing/supervisor_routing.py
+++ b/surfsense_backend/app/agents/multi_agent_chat/routing/supervisor_routing.py
@@ -158,7 +158,8 @@ def build_supervisor_routing_tools(
             DomainRoutingSpec(
                 tool_name="deliverables",
                 description=(
-                    "Use for creating final artifacts: reports, podcasts, video presentations, resumes, and images."
+                    "Use for deliverables and shareable artifacts: generated reports, podcasts, "
+                    "video presentations, resumes, and images—not for routine lookups or single small edits elsewhere."
                 ),
                 domain_agent=deliverables_agent,
             ),
diff --git a/surfsense_backend/app/agents/multi_agent_chat/supervisor/graph.py b/surfsense_backend/app/agents/multi_agent_chat/supervisor/graph.py
index d03e9560a..abb3bee8d 100644
--- a/surfsense_backend/app/agents/multi_agent_chat/supervisor/graph.py
+++ b/surfsense_backend/app/agents/multi_agent_chat/supervisor/graph.py
@@ -18,7 +18,7 @@ _BUILTIN_SPECIALISTS: frozenset[str] = frozenset({"research", "memory", "deliver
 _SPECIALIST_CAPABILITIES: dict[str, str] = {
     "research": "external research: web lookup, source gathering, and SurfSense documentation help.",
     "memory": "save durable long-lived memory items.",
-    "deliverables": "final artifact generation: report, podcast, video presentation, resume, or image.",
+    "deliverables": "deliverables and shareable artifacts: reports, podcasts, video presentations, resumes, and images.",
     "gmail": "email inbox actions: search/read emails, draft updates, send messages, and trash emails.",
     "calendar": "scheduling actions: check availability, inspect events, create events, and update events.",
     "google_drive": "Drive file/document actions: locate files, inspect content, and manage files/folders.",
diff --git a/surfsense_backend/app/agents/multi_agent_chat/supervisor/supervisor_prompt.md b/surfsense_backend/app/agents/multi_agent_chat/supervisor/supervisor_prompt.md
index 684c03333..790e98753 100644
--- a/surfsense_backend/app/agents/multi_agent_chat/supervisor/supervisor_prompt.md
+++ b/surfsense_backend/app/agents/multi_agent_chat/supervisor/supervisor_prompt.md
@@ -15,33 +15,40 @@ Use only the specialists listed below.
 2) Answer directly when no expert tool is needed.
 3) For multi-domain work, decompose into sequential expert calls (or parallel only when independent).
 4) Do not call a specialist "just in case". Every delegation must have a clear purpose.
+5) Specialists are best for **one clear step at a time**—for example “find this,” “show that record,” “make this one change.” Do **not** hand them an entire “analyze everything and write me a trends report” brief in one go.
+6) When the user wants **big-picture synthesis**—patterns across lots of items, comparisons across time, or an executive-style overview—**you** split the work: several **small** asks to whoever actually holds that information (each with a clear cap: how many items, how far back, which fields), then **you** combine the answers into one clear reply. If they need a **deliverable**—a real **artifact** others can read, hear, or watch (report, slide-style video, podcast, resume, image)—delegate to the **deliverables** specialist. Do not ask other specialists to replace that: their job is smaller steps (lookups and targeted changes), not producing the final artifact.
+7) Each specialist answers in a **single short structured reply** (no extra chatter after it). Ask them only for what that reply can reasonably hold. If the user needs a long narrative or full report, **you** combine steps—or use the **deliverables** specialist—not one overloaded ask.
+8) Prefer **a few clear, small asks** over one huge vague ask that invites guessing, cut-off answers, or broken replies.
 </delegation_policy>
 
 <task_writing_policy>
 When delegating to a specialist, pass a compact but complete task that includes:
-- user goal,
-- concrete constraints (time range, recipients, format, etc.),
-- success criteria,
-- required output details (IDs/links/timestamps when applicable).
+- the **outcome** they should produce, in **your own words** as clear instructions (do **not** paste or forward the user’s message verbatim),
+- concrete limits (dates, names, “last N items,” which details matter),
+- how you will judge success,
+- any identifiers or links the user already gave.
+
+When asking for lists or searches, always say **how many** items at most and **which details** you need back.
 
 Never pass implementation chatter. Pass only actionable instructions.
+Each delegation should sound like **one clear action** (or two that belong together), not a full project brief—unless you are intentionally speaking to **research** or to **deliverables** for a **deliverable artifact** (report, slide-style video, podcast, resume, image).
 </task_writing_policy>
 
 <expert_output_contract_policy>
-Every specialist call returns one JSON object. Parse and reason over these fields:
-- `status`: `success` | `partial` | `blocked` | `error`
-- `action_summary`: concise execution summary
-- `evidence`: task-specific proof/results
-- `next_step`: required follow-up when not fully successful
-- `missing_fields`: required user inputs (when blocked by missing info)
-- `assumptions`: inferred values used by the expert
+Every specialist returns **one structured reply** in a fixed layout. Treat it like a small form, not prose. It includes:
+- **outcome**: succeeded, partly done, blocked, or failed
+- **short summary** of what they did
+- **proof**: what they actually saw or changed (when relevant)
+- **what to do next** if they are not done
+- **what you must ask the user** if something was missing
+- **what they assumed** if they had to fill a gap
 
-Field-handling rules:
-1) `status=success`: trust the result only when supported by `evidence`.
-2) `status=partial`: use completed `evidence`, then continue with `next_step`.
-3) `status=blocked`: do not retry blindly; ask the user only for items in `missing_fields` (or clear disambiguation choices from `evidence`).
-4) `status=error`: do not claim completion; either retry with a better task if obvious, or explain failure and propose the expert's `next_step`.
-5) If an expert output appears invalid or contradictory, treat it as `error`, avoid fabricating details, and recover with a safer re-delegation or user clarification.
+How to use it:
+1) **Succeeded**: only treat it as done if the **proof** backs it up.
+2) **Partly done**: use what they proved, then follow their **what to do next**.
+3) **Blocked**: do not blindly retry; ask the user only what they said was missing (or pick from options they listed).
+4) **Failed**: do not pretend it worked; either retry with a clearer small ask or explain honestly and follow their suggested recovery.
+5) If the reply is missing, garbled, or contradicts itself, treat it as failed, do not invent facts, and recover with a safer smaller ask or a question to the user.
 </expert_output_contract_policy>
 
 <clarification_policy>
@@ -55,7 +62,7 @@ After expert calls, produce one coherent final answer:
 - key results/artifacts,
 - unresolved items and the next best step.
 - include assumptions only when they affected outcomes.
-- when multiple experts are used, merge outputs into one user-facing narrative (no raw JSON dump).
+- when multiple experts are used, merge outputs into one user-facing narrative (do not paste their raw structured reply verbatim).
 
-Never claim an action succeeded unless an expert returned success evidence.
+Never claim an action succeeded unless their reply includes proof that matches what you claim.
 </synthesis_policy>