mirror of
https://github.com/MODSetter/SurfSense.git
synced 2026-05-06 22:32:39 +02:00
Tighten supervisor delegation policy and align deliverables wording across prompts.
This commit is contained in:
parent
c077939522
commit
74337b462a
6 changed files with 123 additions and 22 deletions
36
surfsense_backend/AFTER_WORK_FIXES.md
Normal file
36
surfsense_backend/AFTER_WORK_FIXES.md
Normal file
|
|
@ -0,0 +1,36 @@
|
||||||
|
# After Work Fixes
|
||||||
|
|
||||||
|
## Middleware Risk Flags (new_chat)
|
||||||
|
|
||||||
|
These are known "policy/routing via middleware" risks to review later.
|
||||||
|
|
||||||
|
1. `FileIntentMiddleware`
|
||||||
|
- Risk: `file_write` classification can force `write_file`/`edit_file` and override deliverable or connector tool selection.
|
||||||
|
- Example failure: user asks for video/report artifact, agent writes into `/documents/*` instead.
|
||||||
|
|
||||||
|
2. `KnowledgePriorityMiddleware`
|
||||||
|
- Risk: KB planner and injected priority hints can over-anchor turns to KB reads when connector action is the better path.
|
||||||
|
|
||||||
|
3. `KnowledgeTreeMiddleware`
|
||||||
|
- Risk: injected workspace tree can bias behavior toward file navigation/writes by default.
|
||||||
|
|
||||||
|
4. `SurfSenseFilesystemMiddleware` + `KnowledgeBasePersistenceMiddleware`
|
||||||
|
- Risk: mistaken `write_file` actions become persisted NOTE documents in KB, making wrong-path behavior durable.
|
||||||
|
|
||||||
|
5. `PermissionMiddleware`
|
||||||
|
- Risk: deny/ask rules can hide or block the correct tool, appearing as "model chose wrong tool" when it never had access.
|
||||||
|
|
||||||
|
6. Subagent middleware parity (`chat_deepagent.py`)
|
||||||
|
- Risk: parent vs subagent stack differences can produce inconsistent behavior across similar tasks.
|
||||||
|
|
||||||
|
7. `SpillingContextEditingMiddleware` + compaction
|
||||||
|
- Risk: context trimming can remove critical tool evidence and cause wrong retries/tool choices.
|
||||||
|
|
||||||
|
8. `ToolCallNameRepairMiddleware`
|
||||||
|
- Risk: malformed calls may be auto-repaired to unintended tools in edge cases.
|
||||||
|
|
||||||
|
9. `DedupHITLToolCallsMiddleware` / `DoomLoopMiddleware`
|
||||||
|
- Risk: legitimate repeated calls can be suppressed or stopped early.
|
||||||
|
|
||||||
|
10. `MemoryInjectionMiddleware`
|
||||||
|
- Risk: injected memory may bias tool choice away from fresh connector/KB evidence.
|
||||||
|
|
@ -0,0 +1,57 @@
|
||||||
|
# Multi-Agent Capability Parity Checklist
|
||||||
|
|
||||||
|
This checklist tracks whether `multi_agent_chat` has the required capability coverage
|
||||||
|
to be manually tested against `new_chat` in LangSmith.
|
||||||
|
|
||||||
|
Legend:
|
||||||
|
- `[x]` implemented
|
||||||
|
- `[~]` implemented with intentional difference
|
||||||
|
- `[ ]` pending
|
||||||
|
|
||||||
|
## 1) Prompting
|
||||||
|
|
||||||
|
- [x] Supervisor prompt has explicit delegation policy.
|
||||||
|
- [x] Supervisor prompt consumes structured expert outputs (`status`, `evidence`, `next_step`, `missing_fields`, `assumptions`).
|
||||||
|
- [x] Supervisor available specialist list is dynamically rendered from currently registered tools.
|
||||||
|
- [x] All expert prompts are normalized to a shared JSON output contract shape with invariant rules.
|
||||||
|
- [x] Memory wording adapts to thread visibility (user vs team).
|
||||||
|
- [~] `generic_mcp` specialist prompt exists but route is intentionally disabled.
|
||||||
|
|
||||||
|
## 2) Tooling and Routing
|
||||||
|
|
||||||
|
- [x] Built-in specialist routes are wired (`research`, `memory`, `deliverables` when eligible).
|
||||||
|
- [x] Connector specialist routes are gated by available connector inventory.
|
||||||
|
- [x] MCP tools are partitioned and merged into matching specialists.
|
||||||
|
- [x] MCP-only named specialists are routed when present (`linear`, `slack`, `jira`, `clickup`, `airtable`).
|
||||||
|
- [~] `generic_mcp` route is intentionally disabled by product decision.
|
||||||
|
- [x] Delegated child tasks include explicit structured context envelope tags.
|
||||||
|
- [x] Domain-agent outputs are parsed and validated as JSON with safe fallback envelope.
|
||||||
|
|
||||||
|
## 3) Middleware / Runtime
|
||||||
|
|
||||||
|
- [x] Supervisor middleware stack mirrors SurfSense shell used by `new_chat` for core protections.
|
||||||
|
- [~] `SubAgentMiddleware` intentionally omitted (multi-agent architecture uses explicit specialists).
|
||||||
|
- [~] `PermissionMiddleware` intentionally omitted by decision (route gating used instead).
|
||||||
|
- [x] Action-log / compaction / retry / fallback / filesystem / KB middleware are wired for supervisor path.
|
||||||
|
- [x] Agent graph compile path uses `asyncio.to_thread` for heavy build operations.
|
||||||
|
|
||||||
|
## 4) Entry-Point Wiring
|
||||||
|
|
||||||
|
- [x] Authenticated streaming path can route to `create_multi_agent_chat` via feature flag (`MULTI_AGENT_CHAT_ENABLED`).
|
||||||
|
- [x] Resume streaming path can route to `create_multi_agent_chat` via feature flag.
|
||||||
|
- [~] Authenticated stream falls back to `new_chat` when `disabled_tools` is provided (multi-agent does not yet implement disabled-tool filtering parity).
|
||||||
|
- [ ] Anonymous stream path wired to multi-agent (left unchanged for now due anonymous tool allow-list differences).
|
||||||
|
|
||||||
|
## 5) Observability and Validation Readiness
|
||||||
|
|
||||||
|
- [x] Ready for manual LangSmith trace inspection once `MULTI_AGENT_CHAT_ENABLED=true`.
|
||||||
|
- [ ] Formal routing eval harness and benchmark dataset.
|
||||||
|
- [ ] Automated regression checks in CI for routing quality.
|
||||||
|
|
||||||
|
## 6) Manual Benchmark Readiness Decision
|
||||||
|
|
||||||
|
Status: **Ready for manual benchmarking in authenticated flows**.
|
||||||
|
|
||||||
|
Before declaring "better than `new_chat`", still required:
|
||||||
|
- Build and run formal eval/benchmark harness.
|
||||||
|
- Close anonymous-path and disabled-tools parity gaps if they are in benchmark scope.
|
||||||
|
|
@ -2,7 +2,7 @@ You are the SurfSense deliverables operations sub-agent.
|
||||||
You receive delegated instructions from a supervisor agent and return structured results for supervisor synthesis.
|
You receive delegated instructions from a supervisor agent and return structured results for supervisor synthesis.
|
||||||
|
|
||||||
<goal>
|
<goal>
|
||||||
Generate high-quality deliverables with explicit constraints and reliable artifact reporting.
|
Produce **deliverables**: shareable **artifacts** the user keeps (reports, slide-style video presentations, podcasts, resumes, images). Use explicit constraints and reliable proof of what was generated.
|
||||||
</goal>
|
</goal>
|
||||||
|
|
||||||
<available_tools>
|
<available_tools>
|
||||||
|
|
|
||||||
|
|
@ -158,7 +158,8 @@ def build_supervisor_routing_tools(
|
||||||
DomainRoutingSpec(
|
DomainRoutingSpec(
|
||||||
tool_name="deliverables",
|
tool_name="deliverables",
|
||||||
description=(
|
description=(
|
||||||
"Use for creating final artifacts: reports, podcasts, video presentations, resumes, and images."
|
"Use for deliverables and shareable artifacts: generated reports, podcasts, "
|
||||||
|
"video presentations, resumes, and images—not for routine lookups or single small edits elsewhere."
|
||||||
),
|
),
|
||||||
domain_agent=deliverables_agent,
|
domain_agent=deliverables_agent,
|
||||||
),
|
),
|
||||||
|
|
|
||||||
|
|
@ -18,7 +18,7 @@ _BUILTIN_SPECIALISTS: frozenset[str] = frozenset({"research", "memory", "deliver
|
||||||
_SPECIALIST_CAPABILITIES: dict[str, str] = {
|
_SPECIALIST_CAPABILITIES: dict[str, str] = {
|
||||||
"research": "external research: web lookup, source gathering, and SurfSense documentation help.",
|
"research": "external research: web lookup, source gathering, and SurfSense documentation help.",
|
||||||
"memory": "save durable long-lived memory items.",
|
"memory": "save durable long-lived memory items.",
|
||||||
"deliverables": "final artifact generation: report, podcast, video presentation, resume, or image.",
|
"deliverables": "deliverables and shareable artifacts: reports, podcasts, video presentations, resumes, and images.",
|
||||||
"gmail": "email inbox actions: search/read emails, draft updates, send messages, and trash emails.",
|
"gmail": "email inbox actions: search/read emails, draft updates, send messages, and trash emails.",
|
||||||
"calendar": "scheduling actions: check availability, inspect events, create events, and update events.",
|
"calendar": "scheduling actions: check availability, inspect events, create events, and update events.",
|
||||||
"google_drive": "Drive file/document actions: locate files, inspect content, and manage files/folders.",
|
"google_drive": "Drive file/document actions: locate files, inspect content, and manage files/folders.",
|
||||||
|
|
|
||||||
|
|
@ -15,33 +15,40 @@ Use only the specialists listed below.
|
||||||
2) Answer directly when no expert tool is needed.
|
2) Answer directly when no expert tool is needed.
|
||||||
3) For multi-domain work, decompose into sequential expert calls (or parallel only when independent).
|
3) For multi-domain work, decompose into sequential expert calls (or parallel only when independent).
|
||||||
4) Do not call a specialist "just in case". Every delegation must have a clear purpose.
|
4) Do not call a specialist "just in case". Every delegation must have a clear purpose.
|
||||||
|
5) Specialists are best for **one clear step at a time**—for example “find this,” “show that record,” “make this one change.” Do **not** hand them an entire “analyze everything and write me a trends report” brief in one go.
|
||||||
|
6) When the user wants **big-picture synthesis**—patterns across lots of items, comparisons across time, or an executive-style overview—**you** split the work: several **small** asks to whoever actually holds that information (each with a clear cap: how many items, how far back, which fields), then **you** combine the answers into one clear reply. If they need a **deliverable**—a real **artifact** others can read, hear, or watch (report, slide-style video, podcast, resume, image)—delegate to the **deliverables** specialist. Do not ask other specialists to replace that: their job is smaller steps (lookups and targeted changes), not producing the final artifact.
|
||||||
|
7) Each specialist answers in a **single short structured reply** (no extra chatter after it). Ask them only for what that reply can reasonably hold. If the user needs a long narrative or full report, **you** combine steps—or use the **deliverables** specialist—not one overloaded ask.
|
||||||
|
8) Prefer **a few clear, small asks** over one huge vague ask that invites guessing, cut-off answers, or broken replies.
|
||||||
</delegation_policy>
|
</delegation_policy>
|
||||||
|
|
||||||
<task_writing_policy>
|
<task_writing_policy>
|
||||||
When delegating to a specialist, pass a compact but complete task that includes:
|
When delegating to a specialist, pass a compact but complete task that includes:
|
||||||
- user goal,
|
- the **outcome** they should produce, in **your own words** as clear instructions (do **not** paste or forward the user’s message verbatim),
|
||||||
- concrete constraints (time range, recipients, format, etc.),
|
- concrete limits (dates, names, “last N items,” which details matter),
|
||||||
- success criteria,
|
- how you will judge success,
|
||||||
- required output details (IDs/links/timestamps when applicable).
|
- any identifiers or links the user already gave.
|
||||||
|
|
||||||
|
When asking for lists or searches, always say **how many** items at most and **which details** you need back.
|
||||||
|
|
||||||
Never pass implementation chatter. Pass only actionable instructions.
|
Never pass implementation chatter. Pass only actionable instructions.
|
||||||
|
Each delegation should sound like **one clear action** (or two that belong together), not a full project brief—unless you are intentionally speaking to **research** or to **deliverables** for a **deliverable artifact** (report, slide-style video, podcast, resume, image).
|
||||||
</task_writing_policy>
|
</task_writing_policy>
|
||||||
|
|
||||||
<expert_output_contract_policy>
|
<expert_output_contract_policy>
|
||||||
Every specialist call returns one JSON object. Parse and reason over these fields:
|
Every specialist returns **one structured reply** in a fixed layout. Treat it like a small form, not prose. It includes:
|
||||||
- `status`: `success` | `partial` | `blocked` | `error`
|
- **outcome**: succeeded, partly done, blocked, or failed
|
||||||
- `action_summary`: concise execution summary
|
- **short summary** of what they did
|
||||||
- `evidence`: task-specific proof/results
|
- **proof**: what they actually saw or changed (when relevant)
|
||||||
- `next_step`: required follow-up when not fully successful
|
- **what to do next** if they are not done
|
||||||
- `missing_fields`: required user inputs (when blocked by missing info)
|
- **what you must ask the user** if something was missing
|
||||||
- `assumptions`: inferred values used by the expert
|
- **what they assumed** if they had to fill a gap
|
||||||
|
|
||||||
Field-handling rules:
|
How to use it:
|
||||||
1) `status=success`: trust the result only when supported by `evidence`.
|
1) **Succeeded**: only treat it as done if the **proof** backs it up.
|
||||||
2) `status=partial`: use completed `evidence`, then continue with `next_step`.
|
2) **Partly done**: use what they proved, then follow their **what to do next**.
|
||||||
3) `status=blocked`: do not retry blindly; ask the user only for items in `missing_fields` (or clear disambiguation choices from `evidence`).
|
3) **Blocked**: do not blindly retry; ask the user only what they said was missing (or pick from options they listed).
|
||||||
4) `status=error`: do not claim completion; either retry with a better task if obvious, or explain failure and propose the expert's `next_step`.
|
4) **Failed**: do not pretend it worked; either retry with a clearer small ask or explain honestly and follow their suggested recovery.
|
||||||
5) If an expert output appears invalid or contradictory, treat it as `error`, avoid fabricating details, and recover with a safer re-delegation or user clarification.
|
5) If the reply is missing, garbled, or contradicts itself, treat it as failed, do not invent facts, and recover with a safer smaller ask or a question to the user.
|
||||||
</expert_output_contract_policy>
|
</expert_output_contract_policy>
|
||||||
|
|
||||||
<clarification_policy>
|
<clarification_policy>
|
||||||
|
|
@ -55,7 +62,7 @@ After expert calls, produce one coherent final answer:
|
||||||
- key results/artifacts,
|
- key results/artifacts,
|
||||||
- unresolved items and the next best step.
|
- unresolved items and the next best step.
|
||||||
- include assumptions only when they affected outcomes.
|
- include assumptions only when they affected outcomes.
|
||||||
- when multiple experts are used, merge outputs into one user-facing narrative (no raw JSON dump).
|
- when multiple experts are used, merge outputs into one user-facing narrative (do not paste their raw structured reply verbatim).
|
||||||
|
|
||||||
Never claim an action succeeded unless an expert returned success evidence.
|
Never claim an action succeeded unless their reply includes proof that matches what you claim.
|
||||||
</synthesis_policy>
|
</synthesis_policy>
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue