mirror of
https://github.com/MODSetter/SurfSense.git
synced 2026-05-06 14:22:47 +02:00
Remove multi-agent planning docs and fix stream_new_chat logger.
This commit is contained in:
parent
a6540b21c7
commit
d157ceaabc
5 changed files with 2 additions and 375 deletions
|
|
@ -1,36 +0,0 @@
|
|||
# After Work Fixes
|
||||
|
||||
## Middleware Risk Flags (new_chat)
|
||||
|
||||
These are known "policy/routing via middleware" risks to review later.
|
||||
|
||||
1. `FileIntentMiddleware`
|
||||
- Risk: `file_write` classification can force `write_file`/`edit_file` and override deliverable or connector tool selection.
|
||||
- Example failure: user asks for video/report artifact, agent writes into `/documents/*` instead.
|
||||
|
||||
2. `KnowledgePriorityMiddleware`
|
||||
- Risk: KB planner and injected priority hints can over-anchor turns to KB reads when connector action is the better path.
|
||||
|
||||
3. `KnowledgeTreeMiddleware`
|
||||
- Risk: injected workspace tree can bias behavior toward file navigation/writes by default.
|
||||
|
||||
4. `SurfSenseFilesystemMiddleware` + `KnowledgeBasePersistenceMiddleware`
|
||||
- Risk: mistaken `write_file` actions become persisted NOTE documents in KB, making wrong-path behavior durable.
|
||||
|
||||
5. `PermissionMiddleware`
|
||||
- Risk: deny/ask rules can hide or block the correct tool, appearing as "model chose wrong tool" when it never had access.
|
||||
|
||||
6. Subagent middleware parity (`chat_deepagent.py`)
|
||||
- Risk: parent vs subagent stack differences can produce inconsistent behavior across similar tasks.
|
||||
|
||||
7. `SpillingContextEditingMiddleware` + compaction
|
||||
- Risk: context trimming can remove critical tool evidence and cause wrong retries/tool choices.
|
||||
|
||||
8. `ToolCallNameRepairMiddleware`
|
||||
- Risk: malformed calls may be auto-repaired to unintended tools in edge cases.
|
||||
|
||||
9. `DedupHITLToolCallsMiddleware` / `DoomLoopMiddleware`
|
||||
- Risk: legitimate repeated calls can be suppressed or stopped early.
|
||||
|
||||
10. `MemoryInjectionMiddleware`
|
||||
- Risk: injected memory may bias tool choice away from fresh connector/KB evidence.
|
||||
|
|
@ -1,57 +0,0 @@
|
|||
# Multi-Agent Capability Parity Checklist
|
||||
|
||||
This checklist tracks whether `multi_agent_chat` has the required capability coverage
|
||||
to be manually tested against `new_chat` in LangSmith.
|
||||
|
||||
Legend:
|
||||
- `[x]` implemented
|
||||
- `[~]` implemented with intentional difference
|
||||
- `[ ]` pending
|
||||
|
||||
## 1) Prompting
|
||||
|
||||
- [x] Supervisor prompt has explicit delegation policy.
|
||||
- [x] Supervisor prompt consumes structured expert outputs (`status`, `evidence`, `next_step`, `missing_fields`, `assumptions`).
|
||||
- [x] Supervisor available specialist list is dynamically rendered from currently registered tools.
|
||||
- [x] All expert prompts are normalized to a shared JSON output contract shape with invariant rules.
|
||||
- [x] Memory wording adapts to thread visibility (user vs team).
|
||||
- [~] `generic_mcp` specialist prompt exists but route is intentionally disabled.
|
||||
|
||||
## 2) Tooling and Routing
|
||||
|
||||
- [x] Built-in specialist routes are wired (`research`, `memory`, `deliverables` when eligible).
|
||||
- [x] Connector specialist routes are gated by available connector inventory.
|
||||
- [x] MCP tools are partitioned and merged into matching specialists.
|
||||
- [x] MCP-only named specialists are routed when present (`linear`, `slack`, `jira`, `clickup`, `airtable`).
|
||||
- [~] `generic_mcp` route is intentionally disabled by product decision.
|
||||
- [x] Delegated child tasks include explicit structured context envelope tags.
|
||||
- [x] Domain-agent outputs are parsed and validated as JSON with safe fallback envelope.
|
||||
|
||||
## 3) Middleware / Runtime
|
||||
|
||||
- [x] Supervisor middleware stack mirrors SurfSense shell used by `new_chat` for core protections.
|
||||
- [~] `SubAgentMiddleware` intentionally omitted (multi-agent architecture uses explicit specialists).
|
||||
- [~] `PermissionMiddleware` intentionally omitted by decision (route gating used instead).
|
||||
- [x] Action-log / compaction / retry / fallback / filesystem / KB middleware are wired for supervisor path.
|
||||
- [x] Agent graph compile path uses `asyncio.to_thread` for heavy build operations.
|
||||
|
||||
## 4) Entry-Point Wiring
|
||||
|
||||
- [x] Authenticated streaming path can route to `create_multi_agent_chat` via feature flag (`MULTI_AGENT_CHAT_ENABLED`).
|
||||
- [x] Resume streaming path can route to `create_multi_agent_chat` via feature flag.
|
||||
- [~] Authenticated stream falls back to `new_chat` when `disabled_tools` is provided (multi-agent does not yet implement disabled-tool filtering parity).
|
||||
- [ ] Anonymous stream path wired to multi-agent (left unchanged for now due anonymous tool allow-list differences).
|
||||
|
||||
## 5) Observability and Validation Readiness
|
||||
|
||||
- [x] Ready for manual LangSmith trace inspection once `MULTI_AGENT_CHAT_ENABLED=true`.
|
||||
- [ ] Formal routing eval harness and benchmark dataset.
|
||||
- [ ] Automated regression checks in CI for routing quality.
|
||||
|
||||
## 6) Manual Benchmark Readiness Decision
|
||||
|
||||
Status: **Ready for manual benchmarking in authenticated flows**.
|
||||
|
||||
Before declaring "better than `new_chat`", still required:
|
||||
- Build and run formal eval/benchmark harness.
|
||||
- Close anonymous-path and disabled-tools parity gaps if they are in benchmark scope.
|
||||
|
|
@ -1,122 +0,0 @@
|
|||
# `multi_agent_chat` — layout & alignment with `new_chat`
|
||||
|
||||
## Mission
|
||||
|
||||
**Preserve** everything that makes SurfSense chat agents production-grade in `new_chat` (KB, middleware, tools, prompts, safety, observability). **Rework** how those pieces are composed: a clearer **multi-agent** layout (supervisor + domain slices + routing), less accidental coupling, and one explicit assembly path—so the agent stays **excellent** (correct tools, grounded KB, safe permissions, debuggable traces), not just “different folders.”
|
||||
|
||||
Implementation strategy: **reuse `new_chat` modules** (middleware classes, tool factories, KB helpers, prompts composer pieces) wherever possible; **`multi_agent_chat` owns structure and wiring**, not reimplemented business logic.
|
||||
|
||||
---
|
||||
|
||||
## What we must not lose from `new_chat` (capability inventory)
|
||||
|
||||
Use this as a checklist when porting middleware/KB into `multi_agent_chat`. Items map to `surfsense_backend/app/agents/new_chat/`.
|
||||
|
||||
| Area | Capabilities to preserve | Typical locations |
|
||||
|------|-------------------------|-------------------|
|
||||
| **KB & documents** | Hybrid search → priority docs → lazy XML load; workspace tree; anon-document path; KB persistence / commit staging | `middleware/knowledge_search.py`, `knowledge_tree.py`, `kb_persistence.py`, `anonymous_document.py`, `tools/knowledge_base.py`, `search_surfsense_docs.py` |
|
||||
| **Filesystem** | Virtual FS, backends, path resolver, file intent | `middleware/filesystem.py`, `filesystem_backends.py`, `path_resolver.py`, `file_intent.py` |
|
||||
| **Memory & context** | Memory injection, team/private protocols, context schema | `middleware/memory_injection.py`, `prompts/base/memory_protocol_*.md`, `context.py` |
|
||||
| **Safety & quality** | Permissions, doom-loop detection, dedup HITL tool calls, tool-call repair, action logging | `middleware/permission.py`, `doom_loop.py`, `dedup_tool_calls.py`, `tool_call_repair.py`, `action_log.py` |
|
||||
| **Model / context limits** | Compaction, context editing / spill, summarization, model & tool call limits, retries / fallback | `middleware/compaction.py`, `context_editing.py`, `chat_deepagent.py` stack |
|
||||
| **Concurrency & ops** | Busy mutex (single-flight turns), OTel spans | `middleware/busy_mutex.py`, `otel_span.py` |
|
||||
| **Skills & subagents** | Skills backends, subagent specs and wrapping patterns | `middleware/skills_backends.py`, `subagents/` |
|
||||
| **Tools** | Async registry, connector gating, MCP loading, feature-flagged tools | `tools/registry.py`, `feature_flags.py`, `tools/mcp_tool.py` |
|
||||
| **Prompts** | Composer, provider fragments, tool routing (KB vs live connectors), citations | `prompts/composer.py`, `prompts/base/tool_routing_*.md`, `system_prompt.py` |
|
||||
| **Runtime** | Checkpointer, LLM config, `create_agent` + middleware ordering discipline | `checkpointer.py`, `llm_config.py`, `chat_deepagent.py` |
|
||||
|
||||
Not every row applies to the **first** multi-agent graph (e.g. you may start with a subset of middleware). The rule is: **if `new_chat` does it for correctness or safety, we either reuse it or consciously document why this graph differs.**
|
||||
|
||||
---
|
||||
|
||||
## Rework principles (better arrangement, same substance)
|
||||
|
||||
1. **Expert agents**: **`expert_agent/builtins/`** — broad registry **categories** (e.g. research, deliverables), not a single vendor. **`expert_agent/connectors/`** — **external integrations** (one package per product route: Discord, Notion, Gmail, …), each using the same pattern: ``slice_tools.py`` (registry subset or factories) + ``domain_prompt.md`` + ``agent.py``. Cross-cutting helpers live in `core/` or are imported from `new_chat`.
|
||||
2. **Explicit graphs**: supervisor vs domain agents vs routing tools are **named** and testable; avoid one opaque megagraph where behavior is hard to reason about.
|
||||
3. **Single composer**: integration eventually mirrors `create_surfsense_deep_agent` in spirit—**one factory** that attaches middleware, KB, and tools in documented order (see `chat_deepagent.py` comments on ordering).
|
||||
4. **No duplicate KB pipelines**: align with `KnowledgePriorityMiddleware` / tree semantics; don’t invent a second hybrid-search path for the same turn.
|
||||
5. **Parity tests**: when wiring completes, compare behavior against `new_chat` for the same user message + search space where scopes overlap (KB snippet quality, tool allow/deny).
|
||||
|
||||
---
|
||||
|
||||
## Supervisor vs domain agents — tools and context
|
||||
|
||||
**Supervisor (orchestrator)**
|
||||
|
||||
- Keeps a **small tool surface**: one **routing** tool per builtin category (`research`, `memory`, …) and per connector route (`notion`, `gmail`, …) — **not** the full flat `registry.py` tool list on the supervisor.
|
||||
- **KB** should primarily benefit the model via **`new_chat`-style middleware** (e.g. hybrid priority docs → state / system adjunct), not by stacking redundant search tools, unless product explicitly requires them.
|
||||
- **Single hybrid search per user turn** at this layer when possible: full retrieval is expensive; avoid running it again inside every sub-agent for the same message.
|
||||
- Does **not** own **on-demand connector discovery** (e.g. `get_connected_accounts`): orchestration is route-by-intent, not ID resolution.
|
||||
|
||||
**Domain agents (every connector slice — same shape)**
|
||||
|
||||
- Carry tools built from **`new_chat`** (`registry` subsets via ``build_registry_tools_for_category`` per ``TOOL_NAMES_BY_CATEGORY``, plus MCP merge where applicable).
|
||||
- **Curated context belongs in the task message**: when the supervisor calls **any** routing tool, the handler composes the child’s task string so it includes **only** what that domain needs (KB snippets, constraints, distilled facts) — folded into how the task is written — not the full parent transcript. The sub-agent `invoke` stays a tight payload (`messages` + task content); domain middleware can still add connector-local hints. Still **no second full hybrid search** for the same turn unless the subdomain explicitly needs a new query.
|
||||
- **Middleware here** still fits **domain-only** grounding (connector availability, search-space hints, metadata) shared across tools in that subgraph. Reuse or thin-wrap `new_chat.middleware` where it applies to a subgraph.
|
||||
- **Reactive discovery** (resolve a service id mid-task) stays a **tool** on that domain (or shared factory), e.g. `get_connected_accounts` when the model needs it — not something the supervisor must call.
|
||||
|
||||
**Tool grouping by category**
|
||||
|
||||
- Group “horizontal” registry tools by **job** (research, deliverables, creative, …) into **separate compiled subgraphs**; supervisor gets **one routing tool per category** (subagents-as-tools), matching LangChain multi-agent guidance. See prior discussion: not all 10 non–connector-gated tools on the supervisor.
|
||||
|
||||
### KB + virtual filesystem — where it belongs
|
||||
|
||||
In `new_chat`, KB + **virtual FS** (`KnowledgePriorityMiddleware`, tree, **`SurfSenseFilesystemMiddleware`** / **`KBPostgresBackend`**) serves the **orchestrator** that may **read and traverse** the workspace.
|
||||
|
||||
**Connector domain agents** are **not** mini-parents: the **supervisor** should already decide *what* to do and pass a **clear task** (plus any curated KB snippet folded into **`compose_child_task`**). The specialist runs **connector APIs**, not a second document crawl — duplicating full KB+VFS on every domain subgraph **shifts the parent’s exploration work onto the wrong agent** and adds noise.
|
||||
|
||||
So **no child-side filesystem stack by default** for narrow connector subgraphs unless product demands it. Reserve **KB + VFS on a subgraph** for roles that **actually** need heavy document work (research, coding/explore-style agents, deliverables that grep the KB), matching how `new_chat` uses specialists.
|
||||
|
||||
---
|
||||
|
||||
## Inspiration map (`new_chat` → `multi_agent_chat`)
|
||||
|
||||
| Concern in `new_chat` | Primary references | Role in `multi_agent_chat` |
|
||||
|----------------------|-------------------|---------------------------|
|
||||
| **Main factory** | `chat_deepagent.py` (`create_surfsense_deep_agent`) | `integration/create_multi_agent_chat.py` — eventual single composer after KB + middleware land |
|
||||
| **Tool lists** | `tools/registry.py`, `build_tools_async` | **`expert_agent/builtins/`** — category bundles (research, deliverables). **`expert_agent/connectors/`** — per-integration graphs (may use hand-written factories or registry subsets). |
|
||||
| **Middleware stack** | `chat_deepagent.py` → `_build_compiled_agent_blocking`, `middleware/*.py` | **Planned:** `middleware/` — compose `create_agent(..., middleware=[...])` on supervisor and/or domain graphs; reuse or thin-wrap `new_chat.middleware` (ordering matters: see `new_chat` comments, e.g. BusyMutex → OTel → KB priority → filesystem → …) |
|
||||
| **KB / hybrid search** | `middleware/knowledge_search.py` (`KnowledgePriorityMiddleware`), `middleware/knowledge_tree.py`, `tools/knowledge_base.py` | **Planned:** hybrid priority **once per user turn** at orchestrator; **curated KB/context folded into the routing task message** to children (no second full search for the same message unless explicitly scoped otherwise). |
|
||||
| **Prompts** | `prompts/composer.py`, `prompts/base/*`, provider fragments | Vertical **`domain_prompt.md`** per slice + **`supervisor/supervisor_prompt.md`**; optional later: thin composer that injects KB/tool-routing fragments like `tool_routing_*.md` |
|
||||
| **Context / checkpointer** | `context.py`, `checkpointer.py` | Pass **`Checkpointer`** into `create_multi_agent_chat` / `build_supervisor_agent`; align thread IDs with route layer when wired |
|
||||
| **Subagent middleware** | `subagents/config.py` (`_wrap_with_subagent_essentials`) | Domain agents may eventually take **`middleware=`** on `create_agent` mirroring “inherit parent essentials + local rules” |
|
||||
|
||||
---
|
||||
|
||||
## Current package tree
|
||||
|
||||
```
|
||||
multi_agent_chat/
|
||||
__init__.py
|
||||
|
||||
core/ # one concern per subfolder (SRP)
|
||||
prompts/ # read_prompt_md — markdown next to packages
|
||||
agents/ # build_domain_agent — compile subgraph + prompt
|
||||
delegation/ # compose_child_task — supervisor → child message
|
||||
invocation/ # extract_last_assistant_text — invoke result parsing
|
||||
bindings/ # ``connector_binding`` — DB/search-space kwargs (not ``expert_agent.connectors`` vendors)
|
||||
registry/ # TOOL_NAMES_BY_CATEGORY, build_registry_tools_for_category, build_registry_dependencies
|
||||
|
||||
expert_agent/
|
||||
builtins/ # broad categories: research, deliverables
|
||||
connectors/ # one subgraph per vendor route (see TOOL_NAMES_BY_CATEGORY keys)
|
||||
|
||||
routing/
|
||||
domain_routing_spec.py
|
||||
from_domain_agents.py
|
||||
supervisor_routing.py
|
||||
|
||||
supervisor/
|
||||
supervisor_prompt.md
|
||||
graph.py
|
||||
|
||||
integration/
|
||||
create_multi_agent_chat.py
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- LangChain: [Multi-agent](https://docs.langchain.com/oss/python/langchain/multi-agent), [Subagents](https://docs.langchain.com/oss/python/langchain/multi-agent/subagents).
|
||||
- Internal: `surfsense_backend/app/agents/new_chat/chat_deepagent.py`, `middleware/`, `tools/registry.py`.
|
||||
|
|
@ -1,159 +0,0 @@
|
|||
# Multi-Agent Prompt Tuning Playbook
|
||||
|
||||
This playbook defines how to tune `multi_agent_chat` prompts for better outcomes than `new_chat` on delegation quality, lower confusion, and stable tool behavior.
|
||||
|
||||
It is intentionally architecture-aware: this system is a **supervisor + expert tools** pattern, not a single flat tool agent.
|
||||
|
||||
## Why this matters in our architecture
|
||||
|
||||
- The supervisor only sees **routing tools** (e.g. `research`, `gmail`, `calendar`), not low-level connector APIs.
|
||||
- Experts are invoked through `routing/from_domain_agents.py` and receive a single natural-language task via `compose_child_task(...)`.
|
||||
- Because expert context is compact and delegated, prompt quality is the primary control lever for routing accuracy and downstream tool behavior.
|
||||
|
||||
## Authoritative guidance we should follow
|
||||
|
||||
- Anthropic prompt engineering best practices (role clarity, XML structure, explicit tool-use policy, few-shot examples): [Anthropic docs](https://docs.anthropic.com/en/docs/use-xml-tags)
|
||||
- OpenAI function-calling reliability guidance (clear tool descriptions, when/when-not tool usage, small callable surface): [OpenAI function calling guide](https://developers.openai.com/docs/guides/function-calling)
|
||||
- OpenAI prompt engineering (instruction hierarchy and explicit output contracts): [OpenAI prompt engineering guide](https://developers.openai.com/api/docs/guides/prompt-engineering)
|
||||
- LangChain supervisor/subagents guidance (clear subagent names/descriptions, context engineering, routing intent): [LangChain supervisor docs](https://docs.langchain.com/oss/python/langchain/supervisor), [LangChain subagents docs](https://docs.langchain.com/oss/python/langchain/multi-agent/subagents)
|
||||
|
||||
## Current weakness audit (as of now)
|
||||
|
||||
- `supervisor/supervisor_prompt.md` is short and does not define decision policy for ambiguous/multi-domain tasks.
|
||||
- Most expert `domain_prompt.md` files are one-line role statements with no:
|
||||
- scope boundaries and refusal policy,
|
||||
- parameter-resolution behavior,
|
||||
- completion criteria (what must be returned),
|
||||
- failure handling rules,
|
||||
- concrete examples.
|
||||
- Tool descriptions in routing are generic ("Pass a clear natural-language task"), which weakens handoff quality.
|
||||
|
||||
## Prompt design standards (required)
|
||||
|
||||
Apply these standards to supervisor and every expert prompt.
|
||||
|
||||
1. **Role + objective first**
|
||||
- One sentence for identity.
|
||||
- One sentence for success criterion.
|
||||
|
||||
2. **Explicit routing/usage rules**
|
||||
- Tell the model when to use this agent/tool.
|
||||
- Tell it when not to use it.
|
||||
- Include ambiguity fallback ("ask one clarifying question" or "do X conservative default").
|
||||
|
||||
3. **Structured task contract**
|
||||
- Require concise but complete execution reports.
|
||||
- Require IDs/links/timestamps when tool outputs produce them.
|
||||
- For no-op paths, explain why no action was taken.
|
||||
|
||||
4. **Safety + reliability contract**
|
||||
- Never fabricate tool results.
|
||||
- Never claim action if no successful tool call happened.
|
||||
- Surface irreversible/risky actions clearly.
|
||||
|
||||
5. **Few-shot examples**
|
||||
- Include 2-4 minimal examples per domain:
|
||||
- direct success,
|
||||
- ambiguous input,
|
||||
- out-of-scope reroute.
|
||||
|
||||
6. **Concise formatting rules**
|
||||
- Avoid verbosity.
|
||||
- Stable response structure improves orchestration and observability.
|
||||
|
||||
## Supervisor prompt blueprint
|
||||
|
||||
The supervisor prompt should contain these sections in order:
|
||||
|
||||
1. `Role`
|
||||
2. `Available experts` (name + scope + non-scope)
|
||||
3. `Delegation policy`
|
||||
- single-domain -> one expert
|
||||
- multi-domain -> sequence or parallel where independent
|
||||
- no expert needed -> answer directly
|
||||
4. `Task-writing policy` for delegated calls
|
||||
- include user goal, constraints, success criteria
|
||||
- include only needed context
|
||||
5. `Result synthesis policy`
|
||||
- merge expert outputs into one user-facing response
|
||||
- preserve concrete identifiers from expert outputs
|
||||
6. `Failure policy`
|
||||
- retry on recoverable mismatch
|
||||
- ask clarifying question when required field is missing
|
||||
|
||||
## Expert prompt blueprint (per domain)
|
||||
|
||||
Each `domain_prompt.md` should include:
|
||||
|
||||
1. `Role and scope`
|
||||
2. `In-scope actions` (mapped to the exact provided tools)
|
||||
3. `Out-of-scope behavior` (what to return for reroute)
|
||||
4. `Execution rules`
|
||||
- choose the minimum tool sequence that satisfies request
|
||||
- do not guess IDs or parameters
|
||||
- ask concise clarification only when necessary
|
||||
5. `Output contract`
|
||||
- action summary
|
||||
- concrete artifacts/IDs/links generated
|
||||
- unresolved items and next step
|
||||
6. `Examples` (2-4 realistic, short)
|
||||
|
||||
## Domain-specific tuning checklist
|
||||
|
||||
- `research`: enforce source-grounded summaries and explicit uncertainty.
|
||||
- `memory`: strict save criteria (durable preference/fact only) and secret-handling policy.
|
||||
- `deliverables`: require output artifact references and constraints echo.
|
||||
- `gmail` / `calendar`: require recipient/date-time disambiguation policy and timezone handling.
|
||||
- `docs connectors` (`notion`, `confluence`, `drive`, `dropbox`, `onedrive`): require exact page/file target resolution before mutate actions.
|
||||
- chat connectors (`discord`, `teams`, `slack`): require channel/thread context clarity before send actions.
|
||||
- MCP experts: require strict tool-description adherence and no assumption about unavailable endpoints.
|
||||
|
||||
## Tool description tuning rules (routing layer)
|
||||
|
||||
Routing tool descriptions should include:
|
||||
|
||||
- best-fit task types,
|
||||
- disallowed task types,
|
||||
- required task payload hints (e.g. "include recipient + intent + constraints"),
|
||||
- expected result shape.
|
||||
|
||||
This is especially important because supervisor tool choice is heavily influenced by `name + description`.
|
||||
|
||||
## Evaluation plan (before wiring to production)
|
||||
|
||||
Create a prompt eval set with at least 20 tasks:
|
||||
|
||||
- 8 single-domain tasks,
|
||||
- 6 ambiguous tasks (should clarify or route conservatively),
|
||||
- 6 multi-domain tasks (should sequence experts correctly).
|
||||
|
||||
Track:
|
||||
|
||||
- routing accuracy,
|
||||
- unnecessary delegation rate,
|
||||
- tool-call success rate,
|
||||
- clarification precision (ask only when needed),
|
||||
- final answer completeness.
|
||||
|
||||
Use same test set against:
|
||||
|
||||
- current prompts,
|
||||
- tuned prompts v1,
|
||||
- tuned prompts v2.
|
||||
|
||||
Promote only when v2 improves routing accuracy and reduces unnecessary delegation with no regression in tool-call success.
|
||||
|
||||
## Immediate implementation plan
|
||||
|
||||
1. Rewrite `supervisor/supervisor_prompt.md` using the supervisor blueprint.
|
||||
2. Rewrite all expert `domain_prompt.md` files with the expert blueprint.
|
||||
3. Upgrade routing tool descriptions in `routing/supervisor_routing.py` to add "when to use / when not to use".
|
||||
4. Add a lightweight prompt eval script or fixture set for reproducible tuning.
|
||||
|
||||
## Definition of done
|
||||
|
||||
- Every supervisor/expert prompt has explicit scope, failure policy, and output contract.
|
||||
- Every route description encodes clear decision boundaries.
|
||||
- Prompt eval shows measurable gains on routing accuracy and lower unnecessary delegation.
|
||||
- Team can iterate prompt versions without changing core orchestration code.
|
||||
|
||||
|
|
@ -28,6 +28,7 @@ from sqlalchemy import func
|
|||
from sqlalchemy.future import select
|
||||
from sqlalchemy.orm import selectinload
|
||||
|
||||
from app.agents.multi_agent_chat.integration import create_multi_agent_chat
|
||||
from app.agents.new_chat.chat_deepagent import create_surfsense_deep_agent
|
||||
from app.agents.new_chat.checkpointer import get_checkpointer
|
||||
from app.agents.new_chat.filesystem_selection import FilesystemMode, FilesystemSelection
|
||||
|
|
@ -38,7 +39,6 @@ from app.agents.new_chat.llm_config import (
|
|||
load_agent_config,
|
||||
load_global_llm_config_by_id,
|
||||
)
|
||||
from app.agents.multi_agent_chat.integration import create_multi_agent_chat
|
||||
from app.agents.new_chat.memory_extraction import (
|
||||
extract_and_save_memory,
|
||||
extract_and_save_team_memory,
|
||||
|
|
@ -69,6 +69,7 @@ from app.utils.user_message_multimodal import build_human_message_content
|
|||
|
||||
_background_tasks: set[asyncio.Task] = set()
|
||||
_perf_log = get_perf_logger()
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def format_mentioned_surfsense_docs_as_context(
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue