diff --git a/docs/adr/0001-rag-citation-and-context-architecture.md b/docs/adr/0001-rag-citation-and-context-architecture.md deleted file mode 100644 index c377880eb..000000000 --- a/docs/adr/0001-rag-citation-and-context-architecture.md +++ /dev/null @@ -1,634 +0,0 @@ -# ADR 0001 — RAG, Citation, and Context Architecture - -- **Status:** Proposed -- **Date:** 2026-06-24 -- **Owners:** SurfSense core -- **Supersedes:** the pre-agent KB priority/planner injection path - ---- - -## 1. Context & problem - -SurfSense answers questions over a user's indexed knowledge base (documents, -chats, connectors, web results). The current pipeline causes the model to -**hallucinate citations and answers**. Root causes identified during review: - -- **Content/ID split.** The model is asked to author or copy complex identifiers - (`chunk_id`, raw URLs, free-text titles) that sit far from the content they - label. LLMs reliably corrupt nearby digits — so citations point at the wrong - source or at nothing. -- **Pre-agent work.** A planner LLM call + embedding + hybrid search runs in - `before_agent` on every turn (`KnowledgePriorityMiddleware`), plus an eager - `fetch_mentioned_documents` whose chunks are then **discarded**. This adds - latency and context noise before the agent even reasons. -- **Mentions are mismanaged.** An `@document` mention forces a wasted full-chunk - fetch, points at the doc **twice** (inline backtick path + `` - entry), and still requires a read round-trip — then dumps the **whole** doc - regardless of the question. -- **Retrieval quality.** Search retrieves on chunks but collapses to documents, - chunks have **no overlap**, and the reranker exists (`RerankerService`) but is - **not wired** into the agent path. -- **Context bloat.** The workspace tree (up to 4000 tokens) and priority lists are - injected into the durable `messages` list every turn, causing context - distraction/confusion. - -This ADR defines the target architecture. It is the **single source of truth**; -implementation issues should reference section numbers here. - ---- - -## 2. Principles - -1. **The model cites tiny numbers `[n]`, never identifiers.** The server owns the - mapping from `[n]` to a real source. There is nothing for the model to invent. -2. **Retrieval is pull-based, behind tools.** Nothing retrieves before the agent - runs. The agent calls a tool when it needs information. -3. **A mention is scope, not a retrieval trigger.** Mentioning a thing tells the - model the thing exists and gives it a filter it *may* apply — it does not fetch. -4. **Ambient context is not conversation.** Transient per-turn context (tree, - mention scope, memory) is rendered via the system prompt, not appended to the - durable `messages` trajectory. -5. **All complexity lives server-side** (resolver, retriever), so the model's job - stays trivial: read passages, echo the number next to the one you used. - ---- - -## 3. Citation architecture (the spine) - -Everything hangs off this. Build it first. - -### 3.1 What is citable - -Anything that is *information retrieved from a source*. Each source type has a -natural **citable unit**: - -| Source | Citable unit | Entry locator | Enters context via | -|---|---|---|---| -| `kb_chunk` | chunk | `document_id` + `chunk_id` | `search_knowledge_base` | -| `kb_document` | document | `document_id` | `read` (whole doc) | -| `connector_item` | item | `connector_id` + `external_id` | connector tool | -| `web_result` | url | `url` | web search / crawl | -| `chat_turn` | turn | `thread_id` + `message_id` | `@chat` / referenced chat | -| `anon_chunk` | chunk | `session/doc` + `chunk_id` | uploaded anonymous doc | - -**Not citable** (control/pointer — never gets a number): workspace tree, mention -scope notes, `report_context`, the priority/registry listing itself. - -### 3.2 The citation entry (the truth) - -A registered entry is the durable identity of a citable unit: - -```python -class CitationEntry(TypedDict): - n: int # the tiny label shown to the model - source_type: str # "kb_chunk" | "kb_document" | "connector_item" - # | "web_result" | "chat_turn" | "anon_chunk" - locator: dict[str, Any] # source-specific identity (see table 3.1) - display: dict[str, Any] # title, source label, url, date — for the UI pill -``` - -### 3.3 The registry (the bookkeeping) - -Lives in agent **state** so it survives across turns and across orchestrator + -subagents. - -```python -class CitationRegistry(TypedDict): - by_n: dict[int, CitationEntry] # n -> entry (resolve direction) - by_key: dict[str, int] # source_key -> n (dedup / find-or-create) - next_n: int # monotonic counter -``` - -- **`source_key`** is a stable string derived from `(source_type, locator)`, e.g. - `"kb_chunk:42:880"`, `"web_result:https://…"`, `"chat_turn:7:1190"`. -- **Numbering is per-conversation and monotonic.** A given `[n]` never changes - meaning within a conversation. -- **Dedup:** registering an already-seen unit returns its existing `n`. - -### 3.4 The two operations - -```python -def register(registry, source_type, locator, display) -> int: - """Find-or-create. Returns the [n] for this unit.""" - key = make_key(source_type, locator) - if key in registry["by_key"]: - return registry["by_key"][key] - n = registry["next_n"] - registry["next_n"] += 1 - registry["by_n"][n] = {"n": n, "source_type": source_type, - "locator": locator, "display": display} - registry["by_key"][key] = n - return n - -def resolve(registry, n) -> CitationEntry | None: - """Map a model-emitted [n] back to its source. Unknown n -> None (drop).""" - return registry["by_n"].get(n) -``` - -### 3.5 Lifecycle - -``` -source yields item - → register(entry) # source_type + locator + display → assign/reuse [n] - → render passage with [n] # the number sits INLINE next to the content - → model writes "...March 10 [n]" - → resolver: [n] → entry # server-side, on the streamed answer - → frontend renders citation pill -``` - -The model only ever **echoes** a number that was printed next to the content it -used. Unknown/garbled numbers resolve to nothing and are dropped (abstention by -construction). - -### 3.6 Presentation format (``) - -`[n]` must be the **only** citable integer adjacent to each passage. No -`chunk 4 of 19`, no raw ids near the text. Grouping by document is allowed; the -`[n]` is per passage. - -``` - -Excerpts retrieved from the user's knowledge base for this query. -Cite a passage with its [n]. - -Document: "Q3 Launch Notes" (Slack · #launch · 2026-03-02) - [1] We agreed to push launch to March 10. - [2] Marketing will be notified next week. -Document: "Timeline" (Notion · 2026-02-28) - [3] Dates floated were Mar 10 and Mar 17. - -``` - -### 3.7 Reconciliation with the existing token format - -The frontend and evals already parse **`[citation:ID]`** -(`surfsense_web/lib/citations/citation-parser.ts`, -`surfsense_evals/src/surfsense_evals/core/parse/citations.py`). - -**Decision:** keep the wire token `[citation:ID]` where `ID = n`. The model is -instructed to emit `[n]`; a thin normalization step rewrites `[n]` → -`[citation:n]` on the streamed output before it reaches the existing parser, OR -the model is instructed to emit `[citation:n]` directly. Either way `ID` is now a -**small ordinal from the registry**, not a `chunk_id`/url/title. The resolver maps -`n` → `CitationEntry` → the frontend citation object the UI already expects. - -> **Decided (§8.8):** the model emits `[n]` (smallest surface for the model to -> get right); the server normalizes `[n]` → `[citation:n]` before the existing -> parser. - ---- - -## 4. Retrieval architecture (pull-based) - -### 4.0 Execution channels (verified against the codebase) - -The orchestrator (main agent) does **not** own the virtual filesystem. It has a -small fixed toolset; everything else is delegated via `task(, …)`. -Verified in `main_agent/tools/index.py` and `subagents/builtins/knowledge_base`. - -| Capability | Owner | Reached via | -|---|---|---| -| `search_knowledge_base(query, scope?)` — semantic/hybrid **RAG retrieval**, read-only | **orchestrator** | direct call | -| `web_search`, `scrape_webpage` | **orchestrator** | direct call | -| `update_memory`, `create_automation`, `write_todos`, `task` | **orchestrator** | direct call | -| virtual filesystem: `read_file`, `write_file`, `edit_file`, `ls`, `glob`, `grep`, `list_tree`, `rm`, `rmdir`, `move_file` | **knowledge_base subagent** | `task(knowledge_base, …)` | -| connector ops (gmail/slack/jira/…) | **connector subagents** | `task(, …)` | - -Consequences for citations: - -- The **dominant RAG path is orchestrator-direct** (`search_knowledge_base`), so - it registers `[n]` exactly where the answer is composed — **no relay**. -- The **shared registry** (§8.9) is load-bearing only for the **delegated** lanes - (whole-doc reads via `knowledge_base`, connector reads): the subagent registers - into the shared registry and relays `[n]` upward. -- `search_knowledge_base` is **semantic RAG**, distinct from filesystem search - (`grep`/`glob`), which belongs to the subagent. `routing.md` conflates these and - omits `search_knowledge_base` from its direct-tools list — that prompt is stale - and must be corrected (see §7). - -### 4.1 The two retrieval operations - -| Operation | Tool | Owner | For | -|---|---|---|---| -| **search** | `search_knowledge_base(query, scope?)` → chunks, each registered → `[n]` | orchestrator (direct) | "related / scoped question" — RAG | -| **read** | `read_file(path)` (whole object) | knowledge_base subagent (`task`) | "summarize / translate / rewrite / navigate this" | - -The agent chooses based on the query. No server-side intent classifier; the query -semantics decide (summarize ⇒ delegate a `read`; related ⇒ direct `search`). - -### 4.2 `scope` — the mention→retrieval bridge - -`scope` is an **optional typed filter** restricting the search haystack: - -```python -scope = { - "document_ids": [42], - "folder_ids": [], - "connector_ids": [], -} -``` - -- Becomes `WHERE` constraints on the chunk search (`document_id IN (...)`, etc.). -- **Agent-controlled, not automatic.** "in this doc" → agent passes scope; "related" - → agent omits it. -- Spans only **KB-indexed** references (doc/folder/connector). Chats are **not** - KB-indexed (no `CHAT` document type; they live in `NewChatThread` / - `NewChatMessage`, not `Document`/`Chunk`), so `@chat` never appears in `scope` — - it uses the separate read channel in §5. -- **How it reaches the retriever depends on the channel:** - - direct `search_knowledge_base` → `scope` is a **structured tool arg** the - orchestrator passes (new arg to add — current tool has no `scope`). - - delegated `read` / browse → the orchestrator expresses scope in the **task - prompt** (path + ids); the subagent translates it into its filesystem calls. - -**Decision:** even when `scope` pins a single doc, `search_knowledge_base` still -runs full hybrid ranking *within* that doc (a large doc still needs its relevant -passages surfaced) — it does not return raw chunk order. - -### 4.3 Retrieval quality fixes (folded into this work) - -- Return at **chunk granularity** with stable `chunk_id` (no collapse-to-document - that loses the citable unit). -- **Wire the reranker** (`RerankerService`) into the `search_knowledge_base` path. -- **Chunk overlap** in the indexing pipeline (config in `app/config/__init__.py`, - `RecursiveChunker` currently has no overlap). -- Add the `scope` arg to `search_knowledge_base`. - -### 4.4 End-to-end pipeline - -```mermaid -flowchart TD - U["User turn + @mentions"] --> AMB["Mentions → ambient scope note (no fetch)"] - AMB --> ORCH{"ORCHESTRATOR reasons"} - - ORCH -- "scoped/related question" --> SKB["search_knowledge_base(query, scope?)
DIRECT · hybrid + rerank"] - ORCH -- "public web" --> WEB["web_search / scrape_webpage
DIRECT"] - ORCH -- "summarize/read/navigate/mutate" --> TKB["task(knowledge_base, …)
DELEGATE"] - ORCH -- "connector op" --> TCN["task(gmail/slack/…)
DELEGATE"] - - SKB --> REGD["register kb_chunk → [n]"] - WEB --> REGD2["register web_result → [n]"] - - subgraph SUB["SUBAGENTS (filesystem / connector tools)"] - FS["read_file/ls/glob/grep/…"] - CN["connector ops"] - FS --> REGS["register → [n] (SHARED registry)"] - CN --> REGS - REGS --> SYN["synthesize + relay [n] up"] - end - - TKB --> FS - TCN --> CN - - REGD --> COMPOSE["Orchestrator composes answer with [n]"] - REGD2 --> COMPOSE - SYN --> COMPOSE - COMPOSE --> NORM["[n] → [citation:n]"] --> RESOLVE["resolve via shared registry
(unknown → dropped)"] --> UI["Citation pills"] -``` - -### 4.5 Tradeoffs: pull vs push (and perceived latency) - -We chose **pull** (the agent reads/searches via tools when needed) over **push** -(eagerly injecting referenced content into context). Rationale and costs: - -**Why pull is the default** - -- Token efficiency — fetch only what the query needs, not whole docs. -- Scales to many/large mentions, folders, connectors — push cannot. -- Intent-adaptive granularity — passages for scoped Qs, whole doc for summaries. -- Context hygiene — content arrives as *evidence* (`[n]`), not ambient noise. -- Uniform across all mention types. - -**Costs (and why they're acceptable)** - -- **Perceived latency (TTFT).** Pull adds a tool round-trip before answer tokens. - This is the only place push clearly wins. The mitigation is **progress - streaming** (time-to-first-*signal*, not first-*token*): stream "Reading - *Q3 Launch Notes*…" / "Searching your knowledge base…" so the wait feels - productive — the pattern used by Perplexity, Claude, and Cursor. - > **Out of scope for this ADR's rollout.** Progress streaming is a separate - > workstream — it touches the streaming subsystem, not the retrieval/citation - > path. Tracked as an **after-plan follow-up**. Today intermediate/subagent - > steps are largely suppressed (`surfsense:internal`), which is what makes pull - > *feel* slow; the follow-up promotes a curated subset of tool/subagent events - > to user-visible progress. -- **"Cite-without-read" risk** — neutralized structurally: ambient pointers carry - **no `[n]`**; `[n]` exists only after a tool returns evidence; invented `[n]` - resolves to nothing and is dropped. The worst residual case degrades from a - confident wrong citation to an uncited claim (further guarded by content-free - pointers + a "read before you answer" policy line). -- **Delegation synthesis loss** — whole-doc reads go through the KB subagent, - which summarizes back; mitigate by instructing it to return quotes + `[n]`. - -**Conditional hybrid.** A bounded eager fast-path (inject content only when a -single *small* doc is mentioned) may be added **later, only if** latency telemetry -justifies it — not built speculatively. - ---- - -## 5. Mention architecture (scope, not trigger) - -When the user mentions anything: - -1. It is recorded as **ambient scope** in the system prompt (via `dynamic_prompt` - + `runtime.context`), e.g.: - > Referenced this turn: doc 42 (`/documents/Launch/Q3.xml`), folder 7 - > (`/documents/Specs/`). For a scoped question call - > `search_knowledge_base(query, scope={document_ids:[42]})`; to load the whole - > thing delegate `task(knowledge_base, "read /documents/Launch/Q3.xml …")`. -2. **No fetch, no RAG, no `` pre-injection.** -3. The agent decides: direct `search_knowledge_base(query, scope)` (scoped - question) or delegated `task(knowledge_base, …)` read (whole-object intent). - -References split into **two kinds** by whether the source is searchable: - -- **Searchable references** (`@document`, `@folder`, `@connector`, anon upload) — the - source is KB-indexed, so they become `scope` and are pulled via - `search_knowledge_base` / delegated read. Pointer + pull. -- **Read references** (`@chat`) — the source is **not** KB-indexed, so there is - nothing to "search". The thread is a finite, user-selected artifact; its turns are - loaded directly (access-checked) and citable as `chat_turn`. Pointer + read. - -Per mention type (note the channel — direct vs delegated): - -| Mention | Ambient note | Retrieval behavior | Citation kind on use | -|---|---|---|---| -| `@document` | doc id + path | direct `search_knowledge_base(scope={document_ids:[id]})`, or delegated `task(knowledge_base, read …)` | `kb_chunk` / `kb_document` | -| `@folder` | folder id + path | direct `search_knowledge_base(scope={folder_ids:[id]})`, or delegated browse | `kb_chunk` | -| `@connector account` | connector_id + account | `task(, "… connector_id=id")` | `connector_item` | -| `@chat` | thread id + title | **on-demand read** (not `scope`): pointer only; model calls `read_chat(thread_id)` when it needs the conversation, reusing the access-checked `referenced_chat_context` resolver | `chat_turn` | -| anonymous upload | session doc ref | direct `search_knowledge_base(scope=anon)` / delegated read | `anon_chunk` | - ---- - -## 6. Context plane separation - -| Plane | Carries | Mechanism | Lifetime | -|---|---|---|---| -| **Ambient** | workspace tree, mention scope, memory, instructions | system prompt via `dynamic_prompt` + `runtime.context` | per-turn, not persisted in messages | -| **Evidence** | retrieved passages with `[n]` | tool results / `` | enters trajectory when a tool runs | -| **Trajectory** | user/assistant turns, tool calls | `messages` | durable, checkpointed | - -The workspace tree and priority/registry listings move **out** of `messages` into -the ambient plane. - ---- - -## 7. Cleanup (what gets removed/changed) - -Remove from the hot path: - -- `KnowledgePriorityMiddleware` search branch (planner LLM, embedding, hybrid - search in `before_agent`). ✅ **Done** — the whole `knowledge_search.py` - module is deleted. -- `fetch_mentioned_documents` eager chunk pull. -- `` pre-injection and `KbContextProjectionMiddleware` - priority projection. ✅ **Done** — `` is no longer - produced anywhere; `KbContextProjectionMiddleware` is trimmed to a pure - `` projector. The `enable_kb_priority_preinjection` flag and - every `` prompt reference are removed. -- `kb_priority` state plumbing (deleted per §8.10; add a dedicated - `citation_registry` field instead). ✅ **Done** — `kb_priority` / - `KbPriorityEntry` are removed from state + reducers. `kb_matched_chunk_ids` - is already gone (build-order Step 5). - -Keep / add: - -- `search_knowledge_base(query, scope?)` (orchestrator-direct) as the **only** RAG - entry point, returning registered chunks with `[n]`. Add the `scope` arg. -- `read_file` (knowledge_base subagent, via `task`) for whole-object ops; cited - reads register a `kb_document` / `kb_chunk` entry into the shared registry. -- The **citation registry** in state (shared across orchestrator + subagents). -- Reranker wired into `search_knowledge_base`; chunk overlap in indexing. -- Ambient mention note via `dynamic_prompt`. -- **Fix `routing.md`:** add `search_knowledge_base` to the orchestrator's - direct-tools list, and clarify that "search inside the workspace goes through - `task(knowledge_base)`" refers to **filesystem** search (`grep`/`glob`), not the - semantic `search_knowledge_base` tool. - ---- - -## 8. Locked decisions - -1. Model cites `[n]`; server owns `[n] → source` via a registry. ✅ -2. Numbering is **per-conversation, monotonic, dedup'd** (find-or-create). ✅ -3. Retrieval is pull-based: orchestrator-direct `search_knowledge_base` (RAG) + - delegated `read_file` (knowledge_base subagent); no pre-agent retrieval. ✅ -4. Mention = ambient scope; `scope` is an agent-controlled `search_knowledge_base` - filter. ✅ -5. Scoped search still runs full hybrid ranking within scope. ✅ -6. Ambient context (tree, mention scope) lives in the system prompt, not `messages`. ✅ -7. Wire token stays `[citation:ID]` with `ID = n`. ✅ -8. **Model emits `[n]`; the server normalizes `[n]` → `[citation:n]`** on the - streamed output before the existing parser. The model's surface stays minimal. ✅ -9. **Subagent retrievals register into the same conversation `citation_registry`**, - so `[n]` is globally consistent across orchestrator + subagents. This replaces - the Channel A/B relay entirely. ✅ -10. **Delete the legacy `kb_priority` / `kb_matched_chunk_ids` plumbing**; add a - dedicated `citation_registry` field to state rather than overloading old - fields. ✅ -11. **`@chat` is a non-indexed read reference** (chats aren't in `Document`/`Chunk`): - pointer only, loaded **on demand** via a `read_chat(thread_id)` tool that reuses - the access-checked `referenced_chat_context` resolver and registers each surfaced - turn as `chat_turn`. ✅ -12. **One document render for both surfaces.** RAG excerpts - (`search_knowledge_base`) and full reads (`read_file`) render through a *single* - document renderer — same envelope, same `[n]` contract. Completeness is carried - by `view="excerpt"` vs `view="full"`, **not** an `is_complete` boolean and **not** - a numeric coverage count: `view="excerpt"` alone tells the model it saw a slice. - (A `chunks_shown`/`total_chunks` count was considered and dropped — it never had a - total to show for search excerpts, and full reads already say `view="full"`.) Raw - ids and `metadata_json` are dropped from the model's view. - **No `` seek table** — a full read returns the whole document as one - numbered document block (an index keyed by internal ids gives the agent no actionable - signal, and any `[n]`-keyed/preview index adds cognitive load that risks - degrading the primary answer). Supersedes the standalone `` - shape and the removed `is_complete`. See §12. (planned) - -## 9. Open items - -_All decisions locked (§8). Decision #12 is locked but **not yet built** — see the -§12 schema and the rollout follow-ups._ - -## 10. Rollout - -### Already built in parallel (committed, not yet wired) - -`shared/citations/` (registry, markers, normalizer), `shared/retrieved_context/` -(renderer), `shared/retrieval/` (hybrid search + rerank + service), hybrid-search -behavior tests, and the on-contract prompt `base/citation_contract.md` -(`[n]` / `[1][2]`). - -### Two findings that shape the cutover - -- **The agent is already pull-based by default.** `enable_kb_priority_preinjection` - is `False` and `KnowledgePriorityMiddleware` runs `mentions_only=True`; an - on-demand `search_knowledge_base` tool already exists. So the cutover *upgrades - the existing pull tool to the citation spine* — it does not remove eager RAG - (already gated off). -- **The production citation prompt is local to the agent**, at - `main_agent/system_prompt/prompts/citations/on.md` (two-channel - `[citation:chunk_id]`). The composer's `base/citations_on.md` only serves the - anonymous/automation path. Both must learn the `[n]` contract. - -### Phased cutover - -0. **Registry on state.** Add `citation_registry: CitationRegistry` to - `SurfSenseFilesystemState` with a replace reducer; confirm checkpointer - round-trip. -1. **Swap the KB tool.** Rewrite `search_knowledge_base` to call - `search_knowledge_base_context` (renders `` with `[n]`, - mutates the registry) and persist the registry via `Command(update=...)`. -2. **Normalize `[n]` → `[citation:]`.** Finalize-time first (rewrite the - completed assistant text from the checkpointed registry before DB persist); - buffered live-stream normalization is a follow-up. Bare-`[n]` only, so - web_search `[citation:url]` markers are untouched. -3. **Prompt contract (both surfaces).** Update `main_agent/.../citations/on.md` - (production) to teach the `[n]` channel alongside the existing web_search/`task` - channels; reconcile the composer path by folding `citation_contract.md` into - `base/citations_on.md` (then delete `citation_contract.md`). `citations_off.md` - stays. -4. **Mentions → scope.** Map `@document`/`@folder` mentions to - `SearchScope(document_ids=…)` for the tool; retire `kb_priority` mention - surfacing. -5. **Remove the old eager path.** ✅ **Done** — `KnowledgePriorityMiddleware` - and the old `search_knowledge_base` hybrid helper in `knowledge_search.py` - are deleted (the whole module is gone); `kb_context_projection` is trimmed to - a tree-only projector (kept because it still projects `` to - subagents); `kb_priority` state + the `enable_kb_priority_preinjection` flag + - all `` prompt references are removed. Still pending: - `ChucksHybridSearchRetriever` (after migrating `ConnectorService`). Migrate - `web_search` to register `WEB_RESULT` so all citations unify on `[n]` — - **done**, see §12 build-order Step 6. - ---- - -## 11. After-plan follow-ups (separate workstreams) - -Not part of the §10 rollout — different subsystems, tracked here so they aren't -lost: - -- **Progress streaming** (streaming subsystem). Promote a curated subset of - tool/subagent events to user-visible progress ("Reading…", "Searching…") to - collapse *perceived* latency from pull-based retrieval. See §4.5. This is the - mitigation for pull's only real cost, but it touches the streaming pipeline, not - the retrieval/citation path — so it ships independently. - ---- - -## 12. Unified document render (search + read) - -The model meets a knowledge-base document in two moments: as **excerpts** from a -search, and as a **full read** of one object. Today these use two unrelated -shapes (compact text for search; `` + `` + -`` XML for reads), with two different citation tokens. That doubles the -schema the model must learn and is a hallucination surface. We collapse both onto -**one renderer**. - -### Principles - -- **One envelope, two views.** The same renderer renders a document whether it - arrives partial (search) or complete (read). Only the `view` and the set of - passages shown differ. -- **`[n]` is the only citable token**, in both views, assigned by the shared - registry (find-or-create). A chunk first seen in search keeps its `[n]` when the - same doc is later read in full. -- **Completeness is the `view` word, nothing more.** A search result is inherently - excerpts; a read is inherently the whole object. No `is_complete` flag, no numeric - coverage count. `view="excerpt"` tells the model it saw a slice (so it should read - the doc before claiming the doc "only" says X); `view="full"` says it has the whole - object. A `chunks_shown`/`total_chunks` count was considered and rejected: search - excerpts have no total on hand (and we won't add a count query for it), and full - reads are already self-evident from `view`. -- **Drop noise.** Raw `document_id` / `chunk_id` and the `metadata_json` blob - leave the model's view (they stay server-side as registry keys). The model - sees `title`, `source`, and `[n]` passages. -- **No seek table.** A full read returns the whole document as one numbered - document block; the `` line-range map is dropped. It was keyed by internal - `chunk_id` (which the model never sees), so it gave the agent nothing actionable - to seek by. Re-keying it to `[n]` or adding chunk previews would only add cognitive - load the agent must reconcile against the actual content — a hallucination/quality - risk that outweighs the token savings on the rare genuinely-large read. Simpler: - hand over the document, numbered, and let the model read it. - -### Shape - -Excerpt (from `search_knowledge_base`): - -```xml - - [3] We agreed to push launch to March 10. - [4] Marketing will be notified next week. - -``` - -Full (from a read): - -```xml - - [3] We agreed to push launch to March 10. - [4] Marketing will be notified next week. - [7] … - …(all chunks, numbered) - -``` - -`` becomes simply "N documents in excerpt view"; a read is -"one document in full view". This supersedes the standalone `` -renderer decision and confirms the earlier removal of `is_complete`. - -### Build order (one step at a time) - -1. **Registry merge reducer** — `citation_registry` merges (find-or-create union, - re-mint on collision) instead of replacing, so parent/subagent (and parallel) - registrations stay globally consistent. Pure; independently testable. ✅ -2. **One document renderer** with a `view` parameter; point `search_knowledge_base` - at it (excerpt view), replacing today's `retrieved_context` renderer. ✅ -3. **Register-on-read + full view** — the KB read path registers its chunks and - renders through the same renderer (full view); the whole document is returned - numbered, with **no ``**. The `read_file` tool loads the document - via `KBPostgresBackend.aload_document`, renders it against the conversation - registry, and persists `citation_registry`; `build_document_xml` is deleted. ✅ -4. **Retire Channel C** — now that KB reads emit `[n]` (Step 3), the - knowledge_base read/specialist path cites bare `[n]` instead of - `[citation:chunk_id]`. The KB subagent prompts (cloud/desktop, full/read-only) - and `description_readonly.md` were rewritten to the `` - `[n]` format, the `evidence.chunk_ids` field became `evidence.citations`, and - `citations/on.md` folds the KB relay into Channel A (preserve `[n]` from a - specialist verbatim). Channel C is **narrowed, not deleted**: it still covers - `task` specialists that emit `[citation:id]` — today only the deliverables - `knowledge_base` tool, which builds its own `` XML and is not yet on - the registry/`[n]` spine. Migrating that tool (and then fully deleting - Channel C) is a follow-up. ✅ -5. **Delete `kb_matched_chunk_ids`** — with no seek table and no `matched` flag, the - search→read highlighting hand-off has no consumer. Removed: the state field - (`filesystem_state.py`) and its reducer default (`reducers.py`); the - `search_knowledge_base` tool's `_matched_chunk_ids` writer; the dead - `KnowledgePriorityMiddleware` writes plus the `matched_chunk_ids` return of - `_materialize_priority` (`knowledge_search.py`); and the stale - `` / `matched="true"` / `` rendering prose in the cloud - filesystem prompt (`cloud.py`), rewritten to the `` `[n]` - read format. The `resolver.py` docstring reference was dropped and the two - integration assertions that read the field now assert scope confinement via the - rendered `` titles. (The retriever-layer `matched_chunk_ids` - in `chunks_hybrid_search.py` is a separate output shape and is untouched.) ✅ -6. **Web onto the registry (Channel B → A)** — `web_search` now registers each - result as a `WEB_RESULT` (locator `{url}`) and renders a `` block - of `` blocks with `[n]` labels, returning a - `Command(update={messages, citation_registry})` like `search_knowledge_base`. - `markers.py` already maps `WEB_RESULT → url`, so `[n]` resolves end-to-end with - no frontend change. To enable this, the renderer was generalized: a - `RenderablePassage` now carries a generic `locator: dict` (KB fills - `{document_id, chunk_id}`; web fills `{url}`) instead of fixed KB fields, and a - dedicated **citation-state middleware** declares the `citation_registry` channel - for the `research` subagent (which doesn't use the filesystem state). The two - duplicate `web_search` implementations were collapsed into the shared - `app/agents/chat/shared/tools/web_search.py`; the `research` copy was deleted. - Prompts updated: `citations/on.md` drops the web channel (web is now Channel A - `[n]`; only the legacy `[citation:id]` specialist relay remains, relabelled - Channel B), the research subagent prompt cites `[n]`, the main `web_search` - description teaches ``/`[n]`, `off.md` suppresses `[n]` too, and - stale ``/`[citation:chunk_id]` references in `dynamic_context` and - the grok/openai_codex provider hints were corrected to `[n]`. `scrape_webpage` - stays uncited (raw page text, no `[n]`) — a fact from a scrape reports its URL - instead. Connectors and chat turns remain unmigrated (future workstreams). ✅ diff --git a/surfsense_backend/tests/unit/agents/multi_agent_chat/shared/citations/test_registry.py b/surfsense_backend/tests/unit/agents/multi_agent_chat/shared/citations/test_registry.py index 6363ec897..ba2d7cc59 100644 --- a/surfsense_backend/tests/unit/agents/multi_agent_chat/shared/citations/test_registry.py +++ b/surfsense_backend/tests/unit/agents/multi_agent_chat/shared/citations/test_registry.py @@ -1,4 +1,4 @@ -"""Unit tests for the citation registry spine (ADR 0001 §3).""" +"""Unit tests for the citation registry spine.""" from __future__ import annotations