Tighten supervisor delegation policy and align deliverables wording across prompts.

2026-05-06 22:32:39 +02:00 · 2026-04-30 16:43:44 +02:00 · 2026-04-30 16:43:44 +02:00 · 74337b462a
commit 74337b462a
parent c077939522
6 changed files with 123 additions and 22 deletions
--- a/surfsense_backend/AFTER_WORK_FIXES.md
+++ b/surfsense_backend/AFTER_WORK_FIXES.md
@ -0,0 +1,36 @@
 # After Work Fixes
 ## Middleware Risk Flags (new_chat)
 These are known "policy/routing via middleware" risks to review later.
 1. `FileIntentMiddleware`
 - Risk: `file_write` classification can force `write_file`/`edit_file` and override deliverable or connector tool selection.
 - Example failure: user asks for video/report artifact, agent writes into `/documents/*` instead.
 2. `KnowledgePriorityMiddleware`
 - Risk: KB planner and injected priority hints can over-anchor turns to KB reads when connector action is the better path.
 3. `KnowledgeTreeMiddleware`
 - Risk: injected workspace tree can bias behavior toward file navigation/writes by default.
 4. `SurfSenseFilesystemMiddleware` + `KnowledgeBasePersistenceMiddleware`
 - Risk: mistaken `write_file` actions become persisted NOTE documents in KB, making wrong-path behavior durable.
 5. `PermissionMiddleware`
 - Risk: deny/ask rules can hide or block the correct tool, appearing as "model chose wrong tool" when it never had access.
 6. Subagent middleware parity (`chat_deepagent.py`)
 - Risk: parent vs subagent stack differences can produce inconsistent behavior across similar tasks.
 7. `SpillingContextEditingMiddleware` + compaction
 - Risk: context trimming can remove critical tool evidence and cause wrong retries/tool choices.
 8. `ToolCallNameRepairMiddleware`
 - Risk: malformed calls may be auto-repaired to unintended tools in edge cases.
 9. `DedupHITLToolCallsMiddleware` / `DoomLoopMiddleware`
 - Risk: legitimate repeated calls can be suppressed or stopped early.
 10. `MemoryInjectionMiddleware`
 - Risk: injected memory may bias tool choice away from fresh connector/KB evidence.
--- a/surfsense_backend/app/agents/multi_agent_chat/CAPABILITY_PARITY_CHECKLIST.md
+++ b/surfsense_backend/app/agents/multi_agent_chat/CAPABILITY_PARITY_CHECKLIST.md
@ -0,0 +1,57 @@
 # Multi-Agent Capability Parity Checklist
 This checklist tracks whether `multi_agent_chat` has the required capability coverage
 to be manually tested against `new_chat` in LangSmith.
 Legend:
 - `[x]` implemented
 - `[~]` implemented with intentional difference
 - `[ ]` pending
 ## 1) Prompting
 - [x] Supervisor prompt has explicit delegation policy.
 - [x] Supervisor prompt consumes structured expert outputs (`status`, `evidence`, `next_step`, `missing_fields`, `assumptions`).
 - [x] Supervisor available specialist list is dynamically rendered from currently registered tools.
 - [x] All expert prompts are normalized to a shared JSON output contract shape with invariant rules.
 - [x] Memory wording adapts to thread visibility (user vs team).
 - [~] `generic_mcp` specialist prompt exists but route is intentionally disabled.
 ## 2) Tooling and Routing
 - [x] Built-in specialist routes are wired (`research`, `memory`, `deliverables` when eligible).
 - [x] Connector specialist routes are gated by available connector inventory.
 - [x] MCP tools are partitioned and merged into matching specialists.
 - [x] MCP-only named specialists are routed when present (`linear`, `slack`, `jira`, `clickup`, `airtable`).
 - [~] `generic_mcp` route is intentionally disabled by product decision.
 - [x] Delegated child tasks include explicit structured context envelope tags.
 - [x] Domain-agent outputs are parsed and validated as JSON with safe fallback envelope.
 ## 3) Middleware / Runtime
 - [x] Supervisor middleware stack mirrors SurfSense shell used by `new_chat` for core protections.
 - [~] `SubAgentMiddleware` intentionally omitted (multi-agent architecture uses explicit specialists).
 - [~] `PermissionMiddleware` intentionally omitted by decision (route gating used instead).
 - [x] Action-log / compaction / retry / fallback / filesystem / KB middleware are wired for supervisor path.
 - [x] Agent graph compile path uses `asyncio.to_thread` for heavy build operations.
 ## 4) Entry-Point Wiring
 - [x] Authenticated streaming path can route to `create_multi_agent_chat` via feature flag (`MULTI_AGENT_CHAT_ENABLED`).
 - [x] Resume streaming path can route to `create_multi_agent_chat` via feature flag.
 - [~] Authenticated stream falls back to `new_chat` when `disabled_tools` is provided (multi-agent does not yet implement disabled-tool filtering parity).
 - [ ] Anonymous stream path wired to multi-agent (left unchanged for now due anonymous tool allow-list differences).
 ## 5) Observability and Validation Readiness
 - [x] Ready for manual LangSmith trace inspection once `MULTI_AGENT_CHAT_ENABLED=true`.
 - [ ] Formal routing eval harness and benchmark dataset.
 - [ ] Automated regression checks in CI for routing quality.
 ## 6) Manual Benchmark Readiness Decision
 Status: **Ready for manual benchmarking in authenticated flows**.
 Before declaring "better than `new_chat`", still required:
 - Build and run formal eval/benchmark harness.
 - Close anonymous-path and disabled-tools parity gaps if they are in benchmark scope.
--- a/surfsense_backend/app/agents/multi_agent_chat/expert_agent/builtins/deliverables/domain_prompt.md
+++ b/surfsense_backend/app/agents/multi_agent_chat/expert_agent/builtins/deliverables/domain_prompt.md
@ -2,7 +2,7 @@ You are the SurfSense deliverables operations sub-agent.
 You receive delegated instructions from a supervisor agent and return structured results for supervisor synthesis.
 <goal>
-Generate high-quality deliverables with explicit constraints and reliable artifact reporting.
+Produce **deliverables**: shareable **artifacts** the user keeps (reports, slide-style video presentations, podcasts, resumes, images). Use explicit constraints and reliable proof of what was generated.
 </goal>
 <available_tools>
--- a/surfsense_backend/app/agents/multi_agent_chat/routing/supervisor_routing.py
+++ b/surfsense_backend/app/agents/multi_agent_chat/routing/supervisor_routing.py
@ -158,7 +158,8 @@ def build_supervisor_routing_tools(
            DomainRoutingSpec(
                tool_name="deliverables",
                description=(
-                    "Use for creating final artifacts: reports, podcasts, video presentations, resumes, and images."
+                    "Use for deliverables and shareable artifacts: generated reports, podcasts, "
                    "video presentations, resumes, and images—not for routine lookups or single small edits elsewhere."
                ),
                domain_agent=deliverables_agent,
            ),
--- a/surfsense_backend/app/agents/multi_agent_chat/supervisor/graph.py
+++ b/surfsense_backend/app/agents/multi_agent_chat/supervisor/graph.py
@ -18,7 +18,7 @@ _BUILTIN_SPECIALISTS: frozenset[str] = frozenset({"research", "memory", "deliver
 _SPECIALIST_CAPABILITIES: dict[str, str] = {
    "research": "external research: web lookup, source gathering, and SurfSense documentation help.",
    "memory": "save durable long-lived memory items.",
-    "deliverables": "final artifact generation: report, podcast, video presentation, resume, or image.",
+    "deliverables": "deliverables and shareable artifacts: reports, podcasts, video presentations, resumes, and images.",
    "gmail": "email inbox actions: search/read emails, draft updates, send messages, and trash emails.",
    "calendar": "scheduling actions: check availability, inspect events, create events, and update events.",
    "google_drive": "Drive file/document actions: locate files, inspect content, and manage files/folders.",
--- a/surfsense_backend/app/agents/multi_agent_chat/supervisor/supervisor_prompt.md
+++ b/surfsense_backend/app/agents/multi_agent_chat/supervisor/supervisor_prompt.md
@ -15,33 +15,40 @@ Use only the specialists listed below.
 2) Answer directly when no expert tool is needed.
 3) For multi-domain work, decompose into sequential expert calls (or parallel only when independent).
 4) Do not call a specialist "just in case". Every delegation must have a clear purpose.
 5) Specialists are best for **one clear step at a time**—for example “find this,” “show that record,” “make this one change.” Do **not** hand them an entire “analyze everything and write me a trends report” brief in one go.
 6) When the user wants **big-picture synthesis**—patterns across lots of items, comparisons across time, or an executive-style overview—**you** split the work: several **small** asks to whoever actually holds that information (each with a clear cap: how many items, how far back, which fields), then **you** combine the answers into one clear reply. If they need a **deliverable**—a real **artifact** others can read, hear, or watch (report, slide-style video, podcast, resume, image)—delegate to the **deliverables** specialist. Do not ask other specialists to replace that: their job is smaller steps (lookups and targeted changes), not producing the final artifact.
 7) Each specialist answers in a **single short structured reply** (no extra chatter after it). Ask them only for what that reply can reasonably hold. If the user needs a long narrative or full report, **you** combine steps—or use the **deliverables** specialist—not one overloaded ask.
 8) Prefer **a few clear, small asks** over one huge vague ask that invites guessing, cut-off answers, or broken replies.
 </delegation_policy>
 <task_writing_policy>
 When delegating to a specialist, pass a compact but complete task that includes:
- user goal,
+- the **outcome** they should produce, in **your own words** as clear instructions (do **not** paste or forward the user’s message verbatim),
- concrete constraints (time range, recipients, format, etc.),
+- concrete limits (dates, names, “last N items,” which details matter),
- success criteria,
+- how you will judge success,
- required output details (IDs/links/timestamps when applicable).
+- any identifiers or links the user already gave.
 When asking for lists or searches, always say **how many** items at most and **which details** you need back.
 Never pass implementation chatter. Pass only actionable instructions.
 Each delegation should sound like **one clear action** (or two that belong together), not a full project brief—unless you are intentionally speaking to **research** or to **deliverables** for a **deliverable artifact** (report, slide-style video, podcast, resume, image).
 </task_writing_policy>
 <expert_output_contract_policy>
-Every specialist call returns one JSON object. Parse and reason over these fields:
+Every specialist returns **one structured reply** in a fixed layout. Treat it like a small form, not prose. It includes:
- `status`: `success` | `partial` | `blocked` | `error`
+- **outcome**: succeeded, partly done, blocked, or failed
- `action_summary`: concise execution summary
+- **short summary** of what they did
- `evidence`: task-specific proof/results
+- **proof**: what they actually saw or changed (when relevant)
- `next_step`: required follow-up when not fully successful
+- **what to do next** if they are not done
- `missing_fields`: required user inputs (when blocked by missing info)
+- **what you must ask the user** if something was missing
- `assumptions`: inferred values used by the expert
+- **what they assumed** if they had to fill a gap
-Field-handling rules:
+How to use it:
-1) `status=success`: trust the result only when supported by `evidence`.
+1) **Succeeded**: only treat it as done if the **proof** backs it up.
-2) `status=partial`: use completed `evidence`, then continue with `next_step`.
+2) **Partly done**: use what they proved, then follow their **what to do next**.
-3) `status=blocked`: do not retry blindly; ask the user only for items in `missing_fields` (or clear disambiguation choices from `evidence`).
+3) **Blocked**: do not blindly retry; ask the user only what they said was missing (or pick from options they listed).
-4) `status=error`: do not claim completion; either retry with a better task if obvious, or explain failure and propose the expert's `next_step`.
+4) **Failed**: do not pretend it worked; either retry with a clearer small ask or explain honestly and follow their suggested recovery.
-5) If an expert output appears invalid or contradictory, treat it as `error`, avoid fabricating details, and recover with a safer re-delegation or user clarification.
+5) If the reply is missing, garbled, or contradicts itself, treat it as failed, do not invent facts, and recover with a safer smaller ask or a question to the user.
 </expert_output_contract_policy>
 <clarification_policy>
@ -55,7 +62,7 @@ After expert calls, produce one coherent final answer:
 - key results/artifacts,
 - unresolved items and the next best step.
 - include assumptions only when they affected outcomes.
- when multiple experts are used, merge outputs into one user-facing narrative (no raw JSON dump).
+- when multiple experts are used, merge outputs into one user-facing narrative (do not paste their raw structured reply verbatim).
-Never claim an action succeeded unless an expert returned success evidence.
+Never claim an action succeeded unless their reply includes proof that matches what you claim.
 </synthesis_policy>