apunkt/SurfSense

Fork 0

mirror of https://github.com/MODSetter/SurfSense.git synced 2026-06-26 21:39:43 +02:00

CREDO23 7f09c8a290 docs: add RAG citation and context architecture ADR

2026-06-24 21:35:19 +02:00

20 KiB

Raw Blame History

ADR 0001 — RAG, Citation, and Context Architecture

Status: Proposed
Date: 2026-06-24
Owners: SurfSense core
Supersedes: the pre-agent KB priority/planner injection path

1. Context & problem

SurfSense answers questions over a user's indexed knowledge base (documents, chats, connectors, web results). The current pipeline causes the model to hallucinate citations and answers. Root causes identified during review:

Content/ID split. The model is asked to author or copy complex identifiers (chunk_id, raw URLs, free-text titles) that sit far from the content they label. LLMs reliably corrupt nearby digits — so citations point at the wrong source or at nothing.
Pre-agent work. A planner LLM call + embedding + hybrid search runs in before_agent on every turn (KnowledgePriorityMiddleware), plus an eager fetch_mentioned_documents whose chunks are then discarded. This adds latency and context noise before the agent even reasons.
Mentions are mismanaged. An @document mention forces a wasted full-chunk fetch, points at the doc twice (inline backtick path + <priority_documents> entry), and still requires a read round-trip — then dumps the whole doc regardless of the question.
Retrieval quality. Search retrieves on chunks but collapses to documents, chunks have no overlap, and the reranker exists (RerankerService) but is not wired into the agent path.
Context bloat. The workspace tree (up to 4000 tokens) and priority lists are injected into the durable messages list every turn, causing context distraction/confusion.

This ADR defines the target architecture. It is the single source of truth; implementation issues should reference section numbers here.

2. Principles

The model cites tiny numbers [n], never identifiers. The server owns the mapping from [n] to a real source. There is nothing for the model to invent.
Retrieval is pull-based, behind tools. Nothing retrieves before the agent runs. The agent calls a tool when it needs information.
A mention is scope, not a retrieval trigger. Mentioning a thing tells the model the thing exists and gives it a filter it may apply — it does not fetch.
Ambient context is not conversation. Transient per-turn context (tree, mention scope, memory) is rendered via the system prompt, not appended to the durable messages trajectory.
All complexity lives server-side (resolver, retriever), so the model's job stays trivial: read passages, echo the number next to the one you used.

3. Citation architecture (the spine)

Everything hangs off this. Build it first.

3.1 What is citable

Anything that is information retrieved from a source. Each source type has a natural citable unit:

Source	Citable unit	Entry locator	Enters context via
`kb_chunk`	chunk	`document_id` + `chunk_id`	`search_knowledge_base`
`kb_document`	document	`document_id`	`read` (whole doc)
`connector_item`	item	`connector_id` + `external_id`	connector tool
`web_result`	url	`url`	web search / crawl
`chat_turn`	turn	`thread_id` + `message_id`	`@chat` / referenced chat
`anon_chunk`	chunk	`session/doc` + `chunk_id`	uploaded anonymous doc

Not citable (control/pointer — never gets a number): workspace tree, mention scope notes, report_context, the priority/registry listing itself.

3.2 The citation entry (the truth)

A registered entry is the durable identity of a citable unit:

class CitationEntry(TypedDict):
    n: int                      # the tiny label shown to the model
    source_type: str            # "kb_chunk" | "kb_document" | "connector_item"
                                # | "web_result" | "chat_turn" | "anon_chunk"
    locator: dict[str, Any]     # source-specific identity (see table 3.1)
    display: dict[str, Any]     # title, source label, url, date — for the UI pill

3.3 The registry (the bookkeeping)

Lives in agent state so it survives across turns and across orchestrator + subagents.

class CitationRegistry(TypedDict):
    by_n: dict[int, CitationEntry]      # n -> entry  (resolve direction)
    by_key: dict[str, int]              # source_key -> n  (dedup / find-or-create)
    next_n: int                         # monotonic counter

source_key is a stable string derived from (source_type, locator), e.g. "kb_chunk:42:880", "web_result:https://…", "chat_turn:7:1190".
Numbering is per-conversation and monotonic. A given [n] never changes meaning within a conversation.
Dedup: registering an already-seen unit returns its existing n.

3.4 The two operations

def register(registry, source_type, locator, display) -> int:
    """Find-or-create. Returns the [n] for this unit."""
    key = make_key(source_type, locator)
    if key in registry["by_key"]:
        return registry["by_key"][key]
    n = registry["next_n"]
    registry["next_n"] += 1
    registry["by_n"][n] = {"n": n, "source_type": source_type,
                           "locator": locator, "display": display}
    registry["by_key"][key] = n
    return n

def resolve(registry, n) -> CitationEntry | None:
    """Map a model-emitted [n] back to its source. Unknown n -> None (drop)."""
    return registry["by_n"].get(n)

3.5 Lifecycle

source yields item
   → register(entry)            # source_type + locator + display  → assign/reuse [n]
   → render passage with [n]    # the number sits INLINE next to the content
   → model writes "...March 10 [n]"
   → resolver: [n] → entry      # server-side, on the streamed answer
   → frontend renders citation pill

The model only ever echoes a number that was printed next to the content it used. Unknown/garbled numbers resolve to nothing and are dropped (abstention by construction).

3.6 Presentation format (`<retrieved_context>`)

[n] must be the only citable integer adjacent to each passage. No chunk 4 of 19, no raw ids near the text. Grouping by document is allowed; the [n] is per passage.

<retrieved_context>
Excerpts retrieved from the user's knowledge base for this query.
Cite a passage with its [n].

Document: "Q3 Launch Notes" (Slack · #launch · 2026-03-02)
  [1] We agreed to push launch to March 10.
  [2] Marketing will be notified next week.
Document: "Timeline" (Notion · 2026-02-28)
  [3] Dates floated were Mar 10 and Mar 17.
</retrieved_context>

3.7 Reconciliation with the existing token format

The frontend and evals already parse [citation:ID] (surfsense_web/lib/citations/citation-parser.ts, surfsense_evals/src/surfsense_evals/core/parse/citations.py).

Decision: keep the wire token [citation:ID] where ID = n. The model is instructed to emit [n]; a thin normalization step rewrites [n] → [citation:n] on the streamed output before it reaches the existing parser, OR the model is instructed to emit [citation:n] directly. Either way ID is now a small ordinal from the registry, not a chunk_id/url/title. The resolver maps n → CitationEntry → the frontend citation object the UI already expects.

Decided (§8.8): the model emits [n] (smallest surface for the model to get right); the server normalizes [n] → [citation:n] before the existing parser.

4. Retrieval architecture (pull-based)

4.0 Execution channels (verified against the codebase)

The orchestrator (main agent) does not own the virtual filesystem. It has a small fixed toolset; everything else is delegated via task(<specialist>, …). Verified in main_agent/tools/index.py and subagents/builtins/knowledge_base.

Capability	Owner	Reached via
`search_knowledge_base(query, scope?)` — semantic/hybrid RAG retrieval, read-only	orchestrator	direct call
`web_search`, `scrape_webpage`	orchestrator	direct call
`update_memory`, `create_automation`, `write_todos`, `task`	orchestrator	direct call
virtual filesystem: `read_file`, `write_file`, `edit_file`, `ls`, `glob`, `grep`, `list_tree`, `rm`, `rmdir`, `move_file`	knowledge_base subagent	`task(knowledge_base, …)`
connector ops (gmail/slack/jira/…)	connector subagents	`task(<connector>, …)`

Consequences for citations:

The dominant RAG path is orchestrator-direct (search_knowledge_base), so it registers [n] exactly where the answer is composed — no relay.
The shared registry (§8.9) is load-bearing only for the delegated lanes (whole-doc reads via knowledge_base, connector reads): the subagent registers into the shared registry and relays [n] upward.
search_knowledge_base is semantic RAG, distinct from filesystem search (grep/glob), which belongs to the subagent. routing.md conflates these and omits search_knowledge_base from its direct-tools list — that prompt is stale and must be corrected (see §7).

4.1 The two retrieval operations

Operation	Tool	Owner	For
search	`search_knowledge_base(query, scope?)` → chunks, each registered → `[n]`	orchestrator (direct)	"related / scoped question" — RAG
read	`read_file(path)` (whole object)	knowledge_base subagent (`task`)	"summarize / translate / rewrite / navigate this"

The agent chooses based on the query. No server-side intent classifier; the query semantics decide (summarize ⇒ delegate a read; related ⇒ direct search).

4.2 `scope` — the mention→retrieval bridge

scope is an optional typed filter restricting the search haystack:

scope = {
    "document_ids": [42],
    "folder_ids": [],
    "connector_ids": [],
    "thread_ids": [],
}

Becomes WHERE constraints on the chunk search (document_id IN (...), etc.).
Agent-controlled, not automatic. "in this doc" → agent passes scope; "related" → agent omits it.
Uniform across mention types: doc/folder/connector/chat are just keys here.
How it reaches the retriever depends on the channel:
- direct search_knowledge_base → scope is a structured tool arg the orchestrator passes (new arg to add — current tool has no scope).
- delegated read / browse → the orchestrator expresses scope in the task prompt (path + ids); the subagent translates it into its filesystem calls.

Decision: even when scope pins a single doc, search_knowledge_base still runs full hybrid ranking within that doc (a large doc still needs its relevant passages surfaced) — it does not return raw chunk order.

4.3 Retrieval quality fixes (folded into this work)

Return at chunk granularity with stable chunk_id (no collapse-to-document that loses the citable unit).
Wire the reranker (RerankerService) into the search_knowledge_base path.
Chunk overlap in the indexing pipeline (config in app/config/__init__.py, RecursiveChunker currently has no overlap).
Add the scope arg to search_knowledge_base.

4.4 End-to-end pipeline

flowchart TD
    U["User turn + @mentions"] --> AMB["Mentions → ambient scope note (no fetch)"]
    AMB --> ORCH{"ORCHESTRATOR reasons"}

    ORCH -- "scoped/related question" --> SKB["search_knowledge_base(query, scope?)<br/>DIRECT · hybrid + rerank"]
    ORCH -- "public web" --> WEB["web_search / scrape_webpage<br/>DIRECT"]
    ORCH -- "summarize/read/navigate/mutate" --> TKB["task(knowledge_base, …)<br/>DELEGATE"]
    ORCH -- "connector op" --> TCN["task(gmail/slack/…)<br/>DELEGATE"]

    SKB --> REGD["register kb_chunk → [n]"]
    WEB --> REGD2["register web_result → [n]"]

    subgraph SUB["SUBAGENTS (filesystem / connector tools)"]
        FS["read_file/ls/glob/grep/…"]
        CN["connector ops"]
        FS --> REGS["register → [n] (SHARED registry)"]
        CN --> REGS
        REGS --> SYN["synthesize + relay [n] up"]
    end

    TKB --> FS
    TCN --> CN

    REGD --> COMPOSE["Orchestrator composes answer with [n]"]
    REGD2 --> COMPOSE
    SYN --> COMPOSE
    COMPOSE --> NORM["[n] → [citation:n]"] --> RESOLVE["resolve via shared registry<br/>(unknown → dropped)"] --> UI["Citation pills"]

4.5 Tradeoffs: pull vs push (and perceived latency)

We chose pull (the agent reads/searches via tools when needed) over push (eagerly injecting referenced content into context). Rationale and costs:

Why pull is the default

Token efficiency — fetch only what the query needs, not whole docs.
Scales to many/large mentions, folders, connectors — push cannot.
Intent-adaptive granularity — passages for scoped Qs, whole doc for summaries.
Context hygiene — content arrives as evidence ([n]), not ambient noise.
Uniform across all mention types.

Costs (and why they're acceptable)

Perceived latency (TTFT). Pull adds a tool round-trip before answer tokens. This is the only place push clearly wins. The mitigation is progress streaming (time-to-first-signal, not first-token): stream "Reading Q3 Launch Notes…" / "Searching your knowledge base…" so the wait feels productive — the pattern used by Perplexity, Claude, and Cursor.

Out of scope for this ADR's rollout. Progress streaming is a separate workstream — it touches the streaming subsystem, not the retrieval/citation path. Tracked as an after-plan follow-up. Today intermediate/subagent steps are largely suppressed (surfsense:internal), which is what makes pull feel slow; the follow-up promotes a curated subset of tool/subagent events to user-visible progress.
"Cite-without-read" risk — neutralized structurally: ambient pointers carry no [n]; [n] exists only after a tool returns evidence; invented [n] resolves to nothing and is dropped. The worst residual case degrades from a confident wrong citation to an uncited claim (further guarded by content-free pointers + a "read before you answer" policy line).
Delegation synthesis loss — whole-doc reads go through the KB subagent, which summarizes back; mitigate by instructing it to return quotes + [n].

Conditional hybrid. A bounded eager fast-path (inject content only when a single small doc is mentioned) may be added later, only if latency telemetry justifies it — not built speculatively.

5. Mention architecture (scope, not trigger)

When the user mentions anything:

It is recorded as ambient scope in the system prompt (via dynamic_prompt
- runtime.context), e.g.:
Referenced this turn: doc 42 (/documents/Launch/Q3.xml), folder 7 (/documents/Specs/). For a scoped question call search_knowledge_base(query, scope={document_ids:[42]}); to load the whole thing delegate task(knowledge_base, "read /documents/Launch/Q3.xml …").
No fetch, no RAG, no <priority_documents> pre-injection.
The agent decides: direct search_knowledge_base(query, scope) (scoped question) or delegated task(knowledge_base, …) read (whole-object intent).

Per mention type (note the channel — direct vs delegated):

Mention	Ambient note	Retrieval behavior	Citation kind on use
`@document`	doc id + path	direct `search_knowledge_base(scope={document_ids:[id]})`, or delegated `task(knowledge_base, read …)`	`kb_chunk` / `kb_document`
`@folder`	folder id + path	direct `search_knowledge_base(scope={folder_ids:[id]})`, or delegated browse	`kb_chunk`
`@connector account`	connector_id + account	`task(<connector>, "… connector_id=id")`	`connector_item`
`@chat`	thread id	direct `search_knowledge_base(scope={thread_ids:[id]})` (⚠ if chats are KB-indexed; else delegated read)	`chat_turn`
anonymous upload	session doc ref	direct `search_knowledge_base(scope=anon)` / delegated read	`anon_chunk`

6. Context plane separation

Plane	Carries	Mechanism	Lifetime
Ambient	workspace tree, mention scope, memory, instructions	system prompt via `dynamic_prompt` + `runtime.context`	per-turn, not persisted in messages
Evidence	retrieved passages with `[n]`	tool results / `<retrieved_context>`	enters trajectory when a tool runs
Trajectory	user/assistant turns, tool calls	`messages`	durable, checkpointed

The workspace tree and priority/registry listings move out of messages into the ambient plane.

7. Cleanup (what gets removed/changed)

Remove from the hot path:

KnowledgePriorityMiddleware search branch (planner LLM, embedding, hybrid search in before_agent).
fetch_mentioned_documents eager chunk pull.
<priority_documents> pre-injection and KbContextProjectionMiddleware priority projection.
kb_priority / kb_matched_chunk_ids state plumbing (deleted per §8.10; add a dedicated citation_registry field instead).

Keep / add:

search_knowledge_base(query, scope?) (orchestrator-direct) as the only RAG entry point, returning registered chunks with [n]. Add the scope arg.
read_file (knowledge_base subagent, via task) for whole-object ops; cited reads register a kb_document / kb_chunk entry into the shared registry.
The citation registry in state (shared across orchestrator + subagents).
Reranker wired into search_knowledge_base; chunk overlap in indexing.
Ambient mention note via dynamic_prompt.
Fix routing.md: add search_knowledge_base to the orchestrator's direct-tools list, and clarify that "search inside the workspace goes through task(knowledge_base)" refers to filesystem search (grep/glob), not the semantic search_knowledge_base tool.

8. Locked decisions

Model cites [n]; server owns [n] → source via a registry. ✅
Numbering is per-conversation, monotonic, dedup'd (find-or-create). ✅
Retrieval is pull-based: orchestrator-direct search_knowledge_base (RAG) + delegated read_file (knowledge_base subagent); no pre-agent retrieval. ✅
Mention = ambient scope; scope is an agent-controlled search_knowledge_base filter. ✅
Scoped search still runs full hybrid ranking within scope. ✅
Ambient context (tree, mention scope) lives in the system prompt, not messages. ✅
Wire token stays [citation:ID] with ID = n. ✅
Model emits [n]; the server normalizes [n] → [citation:n] on the streamed output before the existing parser. The model's surface stays minimal. ✅
Subagent retrievals register into the same conversation citation_registry, so [n] is globally consistent across orchestrator + subagents. This replaces the Channel A/B relay entirely. ✅
Delete the legacy kb_priority / kb_matched_chunk_ids plumbing; add a dedicated citation_registry field to state rather than overloading old fields. ✅

9. Open items

None — all decisions locked. See §8.

10. Rollout (suggested)

Citation registry + resolver (state + register/resolve) — no behavior change yet.
search_knowledge_base returns registered chunks; render <retrieved_context>; normalize [n] → [citation:n].
Wire reranker; add chunk overlap in indexing.
Convert mentions to ambient scope + scope arg; delete priority pre-injection.
Move workspace tree to ambient plane.
Extend registry to connector/web/chat sources.

11. After-plan follow-ups (separate workstreams)

Not part of the §10 rollout — different subsystems, tracked here so they aren't lost:

Progress streaming (streaming subsystem). Promote a curated subset of tool/subagent events to user-visible progress ("Reading…", "Searching…") to collapse perceived latency from pull-based retrieval. See §4.5. This is the mitigation for pull's only real cost, but it touches the streaming pipeline, not the retrieval/citation path — so it ships independently.

20 KiB Raw Blame History