From 74337b462acb0178bbd8e53182acc4675f28e443 Mon Sep 17 00:00:00 2001 From: CREDO23 Date: Thu, 30 Apr 2026 16:43:44 +0200 Subject: [PATCH] Tighten supervisor delegation policy and align deliverables wording across prompts. --- surfsense_backend/AFTER_WORK_FIXES.md | 36 ++++++++++++ .../CAPABILITY_PARITY_CHECKLIST.md | 57 +++++++++++++++++++ .../builtins/deliverables/domain_prompt.md | 2 +- .../routing/supervisor_routing.py | 3 +- .../multi_agent_chat/supervisor/graph.py | 2 +- .../supervisor/supervisor_prompt.md | 45 ++++++++------- 6 files changed, 123 insertions(+), 22 deletions(-) create mode 100644 surfsense_backend/AFTER_WORK_FIXES.md create mode 100644 surfsense_backend/app/agents/multi_agent_chat/CAPABILITY_PARITY_CHECKLIST.md diff --git a/surfsense_backend/AFTER_WORK_FIXES.md b/surfsense_backend/AFTER_WORK_FIXES.md new file mode 100644 index 000000000..d74b7c7da --- /dev/null +++ b/surfsense_backend/AFTER_WORK_FIXES.md @@ -0,0 +1,36 @@ +# After Work Fixes + +## Middleware Risk Flags (new_chat) + +These are known "policy/routing via middleware" risks to review later. + +1. `FileIntentMiddleware` +- Risk: `file_write` classification can force `write_file`/`edit_file` and override deliverable or connector tool selection. +- Example failure: user asks for video/report artifact, agent writes into `/documents/*` instead. + +2. `KnowledgePriorityMiddleware` +- Risk: KB planner and injected priority hints can over-anchor turns to KB reads when connector action is the better path. + +3. `KnowledgeTreeMiddleware` +- Risk: injected workspace tree can bias behavior toward file navigation/writes by default. + +4. `SurfSenseFilesystemMiddleware` + `KnowledgeBasePersistenceMiddleware` +- Risk: mistaken `write_file` actions become persisted NOTE documents in KB, making wrong-path behavior durable. + +5. `PermissionMiddleware` +- Risk: deny/ask rules can hide or block the correct tool, appearing as "model chose wrong tool" when it never had access. + +6. Subagent middleware parity (`chat_deepagent.py`) +- Risk: parent vs subagent stack differences can produce inconsistent behavior across similar tasks. + +7. `SpillingContextEditingMiddleware` + compaction +- Risk: context trimming can remove critical tool evidence and cause wrong retries/tool choices. + +8. `ToolCallNameRepairMiddleware` +- Risk: malformed calls may be auto-repaired to unintended tools in edge cases. + +9. `DedupHITLToolCallsMiddleware` / `DoomLoopMiddleware` +- Risk: legitimate repeated calls can be suppressed or stopped early. + +10. `MemoryInjectionMiddleware` +- Risk: injected memory may bias tool choice away from fresh connector/KB evidence. diff --git a/surfsense_backend/app/agents/multi_agent_chat/CAPABILITY_PARITY_CHECKLIST.md b/surfsense_backend/app/agents/multi_agent_chat/CAPABILITY_PARITY_CHECKLIST.md new file mode 100644 index 000000000..b759e3221 --- /dev/null +++ b/surfsense_backend/app/agents/multi_agent_chat/CAPABILITY_PARITY_CHECKLIST.md @@ -0,0 +1,57 @@ +# Multi-Agent Capability Parity Checklist + +This checklist tracks whether `multi_agent_chat` has the required capability coverage +to be manually tested against `new_chat` in LangSmith. + +Legend: +- `[x]` implemented +- `[~]` implemented with intentional difference +- `[ ]` pending + +## 1) Prompting + +- [x] Supervisor prompt has explicit delegation policy. +- [x] Supervisor prompt consumes structured expert outputs (`status`, `evidence`, `next_step`, `missing_fields`, `assumptions`). +- [x] Supervisor available specialist list is dynamically rendered from currently registered tools. +- [x] All expert prompts are normalized to a shared JSON output contract shape with invariant rules. +- [x] Memory wording adapts to thread visibility (user vs team). +- [~] `generic_mcp` specialist prompt exists but route is intentionally disabled. + +## 2) Tooling and Routing + +- [x] Built-in specialist routes are wired (`research`, `memory`, `deliverables` when eligible). +- [x] Connector specialist routes are gated by available connector inventory. +- [x] MCP tools are partitioned and merged into matching specialists. +- [x] MCP-only named specialists are routed when present (`linear`, `slack`, `jira`, `clickup`, `airtable`). +- [~] `generic_mcp` route is intentionally disabled by product decision. +- [x] Delegated child tasks include explicit structured context envelope tags. +- [x] Domain-agent outputs are parsed and validated as JSON with safe fallback envelope. + +## 3) Middleware / Runtime + +- [x] Supervisor middleware stack mirrors SurfSense shell used by `new_chat` for core protections. +- [~] `SubAgentMiddleware` intentionally omitted (multi-agent architecture uses explicit specialists). +- [~] `PermissionMiddleware` intentionally omitted by decision (route gating used instead). +- [x] Action-log / compaction / retry / fallback / filesystem / KB middleware are wired for supervisor path. +- [x] Agent graph compile path uses `asyncio.to_thread` for heavy build operations. + +## 4) Entry-Point Wiring + +- [x] Authenticated streaming path can route to `create_multi_agent_chat` via feature flag (`MULTI_AGENT_CHAT_ENABLED`). +- [x] Resume streaming path can route to `create_multi_agent_chat` via feature flag. +- [~] Authenticated stream falls back to `new_chat` when `disabled_tools` is provided (multi-agent does not yet implement disabled-tool filtering parity). +- [ ] Anonymous stream path wired to multi-agent (left unchanged for now due anonymous tool allow-list differences). + +## 5) Observability and Validation Readiness + +- [x] Ready for manual LangSmith trace inspection once `MULTI_AGENT_CHAT_ENABLED=true`. +- [ ] Formal routing eval harness and benchmark dataset. +- [ ] Automated regression checks in CI for routing quality. + +## 6) Manual Benchmark Readiness Decision + +Status: **Ready for manual benchmarking in authenticated flows**. + +Before declaring "better than `new_chat`", still required: +- Build and run formal eval/benchmark harness. +- Close anonymous-path and disabled-tools parity gaps if they are in benchmark scope. diff --git a/surfsense_backend/app/agents/multi_agent_chat/expert_agent/builtins/deliverables/domain_prompt.md b/surfsense_backend/app/agents/multi_agent_chat/expert_agent/builtins/deliverables/domain_prompt.md index e334a921b..c44f131bb 100644 --- a/surfsense_backend/app/agents/multi_agent_chat/expert_agent/builtins/deliverables/domain_prompt.md +++ b/surfsense_backend/app/agents/multi_agent_chat/expert_agent/builtins/deliverables/domain_prompt.md @@ -2,7 +2,7 @@ You are the SurfSense deliverables operations sub-agent. You receive delegated instructions from a supervisor agent and return structured results for supervisor synthesis. -Generate high-quality deliverables with explicit constraints and reliable artifact reporting. +Produce **deliverables**: shareable **artifacts** the user keeps (reports, slide-style video presentations, podcasts, resumes, images). Use explicit constraints and reliable proof of what was generated. diff --git a/surfsense_backend/app/agents/multi_agent_chat/routing/supervisor_routing.py b/surfsense_backend/app/agents/multi_agent_chat/routing/supervisor_routing.py index 91fac9cd5..ef496ab17 100644 --- a/surfsense_backend/app/agents/multi_agent_chat/routing/supervisor_routing.py +++ b/surfsense_backend/app/agents/multi_agent_chat/routing/supervisor_routing.py @@ -158,7 +158,8 @@ def build_supervisor_routing_tools( DomainRoutingSpec( tool_name="deliverables", description=( - "Use for creating final artifacts: reports, podcasts, video presentations, resumes, and images." + "Use for deliverables and shareable artifacts: generated reports, podcasts, " + "video presentations, resumes, and images—not for routine lookups or single small edits elsewhere." ), domain_agent=deliverables_agent, ), diff --git a/surfsense_backend/app/agents/multi_agent_chat/supervisor/graph.py b/surfsense_backend/app/agents/multi_agent_chat/supervisor/graph.py index d03e9560a..abb3bee8d 100644 --- a/surfsense_backend/app/agents/multi_agent_chat/supervisor/graph.py +++ b/surfsense_backend/app/agents/multi_agent_chat/supervisor/graph.py @@ -18,7 +18,7 @@ _BUILTIN_SPECIALISTS: frozenset[str] = frozenset({"research", "memory", "deliver _SPECIALIST_CAPABILITIES: dict[str, str] = { "research": "external research: web lookup, source gathering, and SurfSense documentation help.", "memory": "save durable long-lived memory items.", - "deliverables": "final artifact generation: report, podcast, video presentation, resume, or image.", + "deliverables": "deliverables and shareable artifacts: reports, podcasts, video presentations, resumes, and images.", "gmail": "email inbox actions: search/read emails, draft updates, send messages, and trash emails.", "calendar": "scheduling actions: check availability, inspect events, create events, and update events.", "google_drive": "Drive file/document actions: locate files, inspect content, and manage files/folders.", diff --git a/surfsense_backend/app/agents/multi_agent_chat/supervisor/supervisor_prompt.md b/surfsense_backend/app/agents/multi_agent_chat/supervisor/supervisor_prompt.md index 684c03333..790e98753 100644 --- a/surfsense_backend/app/agents/multi_agent_chat/supervisor/supervisor_prompt.md +++ b/surfsense_backend/app/agents/multi_agent_chat/supervisor/supervisor_prompt.md @@ -15,33 +15,40 @@ Use only the specialists listed below. 2) Answer directly when no expert tool is needed. 3) For multi-domain work, decompose into sequential expert calls (or parallel only when independent). 4) Do not call a specialist "just in case". Every delegation must have a clear purpose. +5) Specialists are best for **one clear step at a time**—for example “find this,” “show that record,” “make this one change.” Do **not** hand them an entire “analyze everything and write me a trends report” brief in one go. +6) When the user wants **big-picture synthesis**—patterns across lots of items, comparisons across time, or an executive-style overview—**you** split the work: several **small** asks to whoever actually holds that information (each with a clear cap: how many items, how far back, which fields), then **you** combine the answers into one clear reply. If they need a **deliverable**—a real **artifact** others can read, hear, or watch (report, slide-style video, podcast, resume, image)—delegate to the **deliverables** specialist. Do not ask other specialists to replace that: their job is smaller steps (lookups and targeted changes), not producing the final artifact. +7) Each specialist answers in a **single short structured reply** (no extra chatter after it). Ask them only for what that reply can reasonably hold. If the user needs a long narrative or full report, **you** combine steps—or use the **deliverables** specialist—not one overloaded ask. +8) Prefer **a few clear, small asks** over one huge vague ask that invites guessing, cut-off answers, or broken replies. When delegating to a specialist, pass a compact but complete task that includes: -- user goal, -- concrete constraints (time range, recipients, format, etc.), -- success criteria, -- required output details (IDs/links/timestamps when applicable). +- the **outcome** they should produce, in **your own words** as clear instructions (do **not** paste or forward the user’s message verbatim), +- concrete limits (dates, names, “last N items,” which details matter), +- how you will judge success, +- any identifiers or links the user already gave. + +When asking for lists or searches, always say **how many** items at most and **which details** you need back. Never pass implementation chatter. Pass only actionable instructions. +Each delegation should sound like **one clear action** (or two that belong together), not a full project brief—unless you are intentionally speaking to **research** or to **deliverables** for a **deliverable artifact** (report, slide-style video, podcast, resume, image). -Every specialist call returns one JSON object. Parse and reason over these fields: -- `status`: `success` | `partial` | `blocked` | `error` -- `action_summary`: concise execution summary -- `evidence`: task-specific proof/results -- `next_step`: required follow-up when not fully successful -- `missing_fields`: required user inputs (when blocked by missing info) -- `assumptions`: inferred values used by the expert +Every specialist returns **one structured reply** in a fixed layout. Treat it like a small form, not prose. It includes: +- **outcome**: succeeded, partly done, blocked, or failed +- **short summary** of what they did +- **proof**: what they actually saw or changed (when relevant) +- **what to do next** if they are not done +- **what you must ask the user** if something was missing +- **what they assumed** if they had to fill a gap -Field-handling rules: -1) `status=success`: trust the result only when supported by `evidence`. -2) `status=partial`: use completed `evidence`, then continue with `next_step`. -3) `status=blocked`: do not retry blindly; ask the user only for items in `missing_fields` (or clear disambiguation choices from `evidence`). -4) `status=error`: do not claim completion; either retry with a better task if obvious, or explain failure and propose the expert's `next_step`. -5) If an expert output appears invalid or contradictory, treat it as `error`, avoid fabricating details, and recover with a safer re-delegation or user clarification. +How to use it: +1) **Succeeded**: only treat it as done if the **proof** backs it up. +2) **Partly done**: use what they proved, then follow their **what to do next**. +3) **Blocked**: do not blindly retry; ask the user only what they said was missing (or pick from options they listed). +4) **Failed**: do not pretend it worked; either retry with a clearer small ask or explain honestly and follow their suggested recovery. +5) If the reply is missing, garbled, or contradicts itself, treat it as failed, do not invent facts, and recover with a safer smaller ask or a question to the user. @@ -55,7 +62,7 @@ After expert calls, produce one coherent final answer: - key results/artifacts, - unresolved items and the next best step. - include assumptions only when they affected outcomes. -- when multiple experts are used, merge outputs into one user-facing narrative (no raw JSON dump). +- when multiple experts are used, merge outputs into one user-facing narrative (do not paste their raw structured reply verbatim). -Never claim an action succeeded unless an expert returned success evidence. +Never claim an action succeeded unless their reply includes proof that matches what you claim.