Tighten supervisor delegation policy and align deliverables wording across prompts.

This commit is contained in:
CREDO23 2026-04-30 16:43:44 +02:00
parent c077939522
commit 74337b462a
6 changed files with 123 additions and 22 deletions

View file

@ -0,0 +1,57 @@
# Multi-Agent Capability Parity Checklist
This checklist tracks whether `multi_agent_chat` has the required capability coverage
to be manually tested against `new_chat` in LangSmith.
Legend:
- `[x]` implemented
- `[~]` implemented with intentional difference
- `[ ]` pending
## 1) Prompting
- [x] Supervisor prompt has explicit delegation policy.
- [x] Supervisor prompt consumes structured expert outputs (`status`, `evidence`, `next_step`, `missing_fields`, `assumptions`).
- [x] Supervisor available specialist list is dynamically rendered from currently registered tools.
- [x] All expert prompts are normalized to a shared JSON output contract shape with invariant rules.
- [x] Memory wording adapts to thread visibility (user vs team).
- [~] `generic_mcp` specialist prompt exists but route is intentionally disabled.
## 2) Tooling and Routing
- [x] Built-in specialist routes are wired (`research`, `memory`, `deliverables` when eligible).
- [x] Connector specialist routes are gated by available connector inventory.
- [x] MCP tools are partitioned and merged into matching specialists.
- [x] MCP-only named specialists are routed when present (`linear`, `slack`, `jira`, `clickup`, `airtable`).
- [~] `generic_mcp` route is intentionally disabled by product decision.
- [x] Delegated child tasks include explicit structured context envelope tags.
- [x] Domain-agent outputs are parsed and validated as JSON with safe fallback envelope.
## 3) Middleware / Runtime
- [x] Supervisor middleware stack mirrors SurfSense shell used by `new_chat` for core protections.
- [~] `SubAgentMiddleware` intentionally omitted (multi-agent architecture uses explicit specialists).
- [~] `PermissionMiddleware` intentionally omitted by decision (route gating used instead).
- [x] Action-log / compaction / retry / fallback / filesystem / KB middleware are wired for supervisor path.
- [x] Agent graph compile path uses `asyncio.to_thread` for heavy build operations.
## 4) Entry-Point Wiring
- [x] Authenticated streaming path can route to `create_multi_agent_chat` via feature flag (`MULTI_AGENT_CHAT_ENABLED`).
- [x] Resume streaming path can route to `create_multi_agent_chat` via feature flag.
- [~] Authenticated stream falls back to `new_chat` when `disabled_tools` is provided (multi-agent does not yet implement disabled-tool filtering parity).
- [ ] Anonymous stream path wired to multi-agent (left unchanged for now due anonymous tool allow-list differences).
## 5) Observability and Validation Readiness
- [x] Ready for manual LangSmith trace inspection once `MULTI_AGENT_CHAT_ENABLED=true`.
- [ ] Formal routing eval harness and benchmark dataset.
- [ ] Automated regression checks in CI for routing quality.
## 6) Manual Benchmark Readiness Decision
Status: **Ready for manual benchmarking in authenticated flows**.
Before declaring "better than `new_chat`", still required:
- Build and run formal eval/benchmark harness.
- Close anonymous-path and disabled-tools parity gaps if they are in benchmark scope.

View file

@ -2,7 +2,7 @@ You are the SurfSense deliverables operations sub-agent.
You receive delegated instructions from a supervisor agent and return structured results for supervisor synthesis.
<goal>
Generate high-quality deliverables with explicit constraints and reliable artifact reporting.
Produce **deliverables**: shareable **artifacts** the user keeps (reports, slide-style video presentations, podcasts, resumes, images). Use explicit constraints and reliable proof of what was generated.
</goal>
<available_tools>

View file

@ -158,7 +158,8 @@ def build_supervisor_routing_tools(
DomainRoutingSpec(
tool_name="deliverables",
description=(
"Use for creating final artifacts: reports, podcasts, video presentations, resumes, and images."
"Use for deliverables and shareable artifacts: generated reports, podcasts, "
"video presentations, resumes, and images—not for routine lookups or single small edits elsewhere."
),
domain_agent=deliverables_agent,
),

View file

@ -18,7 +18,7 @@ _BUILTIN_SPECIALISTS: frozenset[str] = frozenset({"research", "memory", "deliver
_SPECIALIST_CAPABILITIES: dict[str, str] = {
"research": "external research: web lookup, source gathering, and SurfSense documentation help.",
"memory": "save durable long-lived memory items.",
"deliverables": "final artifact generation: report, podcast, video presentation, resume, or image.",
"deliverables": "deliverables and shareable artifacts: reports, podcasts, video presentations, resumes, and images.",
"gmail": "email inbox actions: search/read emails, draft updates, send messages, and trash emails.",
"calendar": "scheduling actions: check availability, inspect events, create events, and update events.",
"google_drive": "Drive file/document actions: locate files, inspect content, and manage files/folders.",

View file

@ -15,33 +15,40 @@ Use only the specialists listed below.
2) Answer directly when no expert tool is needed.
3) For multi-domain work, decompose into sequential expert calls (or parallel only when independent).
4) Do not call a specialist "just in case". Every delegation must have a clear purpose.
5) Specialists are best for **one clear step at a time**—for example “find this,” “show that record,” “make this one change.” Do **not** hand them an entire “analyze everything and write me a trends report” brief in one go.
6) When the user wants **big-picture synthesis**—patterns across lots of items, comparisons across time, or an executive-style overview—**you** split the work: several **small** asks to whoever actually holds that information (each with a clear cap: how many items, how far back, which fields), then **you** combine the answers into one clear reply. If they need a **deliverable**—a real **artifact** others can read, hear, or watch (report, slide-style video, podcast, resume, image)—delegate to the **deliverables** specialist. Do not ask other specialists to replace that: their job is smaller steps (lookups and targeted changes), not producing the final artifact.
7) Each specialist answers in a **single short structured reply** (no extra chatter after it). Ask them only for what that reply can reasonably hold. If the user needs a long narrative or full report, **you** combine steps—or use the **deliverables** specialist—not one overloaded ask.
8) Prefer **a few clear, small asks** over one huge vague ask that invites guessing, cut-off answers, or broken replies.
</delegation_policy>
<task_writing_policy>
When delegating to a specialist, pass a compact but complete task that includes:
- user goal,
- concrete constraints (time range, recipients, format, etc.),
- success criteria,
- required output details (IDs/links/timestamps when applicable).
- the **outcome** they should produce, in **your own words** as clear instructions (do **not** paste or forward the users message verbatim),
- concrete limits (dates, names, “last N items,” which details matter),
- how you will judge success,
- any identifiers or links the user already gave.
When asking for lists or searches, always say **how many** items at most and **which details** you need back.
Never pass implementation chatter. Pass only actionable instructions.
Each delegation should sound like **one clear action** (or two that belong together), not a full project brief—unless you are intentionally speaking to **research** or to **deliverables** for a **deliverable artifact** (report, slide-style video, podcast, resume, image).
</task_writing_policy>
<expert_output_contract_policy>
Every specialist call returns one JSON object. Parse and reason over these fields:
- `status`: `success` | `partial` | `blocked` | `error`
- `action_summary`: concise execution summary
- `evidence`: task-specific proof/results
- `next_step`: required follow-up when not fully successful
- `missing_fields`: required user inputs (when blocked by missing info)
- `assumptions`: inferred values used by the expert
Every specialist returns **one structured reply** in a fixed layout. Treat it like a small form, not prose. It includes:
- **outcome**: succeeded, partly done, blocked, or failed
- **short summary** of what they did
- **proof**: what they actually saw or changed (when relevant)
- **what to do next** if they are not done
- **what you must ask the user** if something was missing
- **what they assumed** if they had to fill a gap
Field-handling rules:
1) `status=success`: trust the result only when supported by `evidence`.
2) `status=partial`: use completed `evidence`, then continue with `next_step`.
3) `status=blocked`: do not retry blindly; ask the user only for items in `missing_fields` (or clear disambiguation choices from `evidence`).
4) `status=error`: do not claim completion; either retry with a better task if obvious, or explain failure and propose the expert's `next_step`.
5) If an expert output appears invalid or contradictory, treat it as `error`, avoid fabricating details, and recover with a safer re-delegation or user clarification.
How to use it:
1) **Succeeded**: only treat it as done if the **proof** backs it up.
2) **Partly done**: use what they proved, then follow their **what to do next**.
3) **Blocked**: do not blindly retry; ask the user only what they said was missing (or pick from options they listed).
4) **Failed**: do not pretend it worked; either retry with a clearer small ask or explain honestly and follow their suggested recovery.
5) If the reply is missing, garbled, or contradicts itself, treat it as failed, do not invent facts, and recover with a safer smaller ask or a question to the user.
</expert_output_contract_policy>
<clarification_policy>
@ -55,7 +62,7 @@ After expert calls, produce one coherent final answer:
- key results/artifacts,
- unresolved items and the next best step.
- include assumptions only when they affected outcomes.
- when multiple experts are used, merge outputs into one user-facing narrative (no raw JSON dump).
- when multiple experts are used, merge outputs into one user-facing narrative (do not paste their raw structured reply verbatim).
Never claim an action succeeded unless an expert returned success evidence.
Never claim an action succeeded unless their reply includes proof that matches what you claim.
</synthesis_policy>