mirror of
https://github.com/MODSetter/SurfSense.git
synced 2026-05-27 19:25:15 +02:00
Merge remote-tracking branch 'upstream/dev' into feat/opentelemetry
This commit is contained in:
commit
98e3950dc8
20 changed files with 569 additions and 122 deletions
|
|
@ -1,11 +1,42 @@
|
|||
<citations>
|
||||
Apply chunk citations only when the runtime injects `<document>` /
|
||||
`<chunk id='…'>` blocks.
|
||||
Citations reach the answer through two channels. Use whichever applies — and
|
||||
never invent ids you didn't see. Citation ids are resolved by exact-match
|
||||
lookup; a wrong id silently breaks the link, so when in doubt, omit.
|
||||
|
||||
### Channel A — chunk blocks injected this turn
|
||||
When `search_surfsense_docs` or `web_search` returns `<document>` /
|
||||
`<chunk id='…'>` blocks in this turn:
|
||||
|
||||
1. For each factual statement taken from those chunks, add
|
||||
`[citation:chunk_id]` using the exact id from `<chunk id='…'>`.
|
||||
2. Multiple chunks → `[citation:id1], [citation:id2]` (comma-separated).
|
||||
3. Never invent or normalise ids; if unsure, omit.
|
||||
4. Plain brackets only — no markdown links, no footnote numbering.
|
||||
5. If no chunk-tagged documents appear this turn, do not fabricate citations.
|
||||
`[citation:chunk_id]` using the **exact** id from a visible
|
||||
`<chunk id='…'>` tag. Copy digit-for-digit (or the URL verbatim);
|
||||
do not retype from memory.
|
||||
2. `<document_id>` is the parent doc id, **not** a citation source —
|
||||
only ids inside `<chunk id='…'>` count.
|
||||
3. Multiple chunks → `[citation:id1], [citation:id2]` (comma-separated,
|
||||
each id copied individually).
|
||||
4. Never invent, normalise, or guess at adjacent ids; if unsure, omit.
|
||||
5. Plain brackets only — no markdown links, no footnote numbering.
|
||||
|
||||
### Channel B — citations relayed by a `task` specialist
|
||||
A `task(...)` tool message may contain `[citation:<chunk_id>]` markers
|
||||
the specialist already attached to its prose. The specialist saw the
|
||||
underlying `<chunk id='…'>` blocks; you didn't. So:
|
||||
|
||||
1. **Preserve those markers verbatim** in your final answer — do not
|
||||
reformat, renumber, drop, or wrap them in markdown links. When you
|
||||
paraphrase a specialist sentence, copy the marker character-for-
|
||||
character; do not regenerate the id from memory (LLMs reliably
|
||||
corrupt nearby digits).
|
||||
2. Keep each marker attached to the sentence the specialist attached
|
||||
it to.
|
||||
3. Do **not** add new `[citation:…]` markers of your own to a
|
||||
specialist's prose; if a fact has no marker, the specialist
|
||||
couldn't tie it to a chunk and neither can you.
|
||||
4. When a specialist returns JSON, the citation markers live inside
|
||||
the prose-bearing fields (e.g. a summary or excerpt). Pull them
|
||||
along with the surrounding sentence when you quote.
|
||||
|
||||
If neither channel surfaces citation markers this turn, do not fabricate
|
||||
them.
|
||||
</citations>
|
||||
|
|
|
|||
|
|
@ -2,4 +2,4 @@ Read-only specialist for the user's workspace (documents and folders). Use to fi
|
|||
|
||||
Pass your full question as one string. The specialist runs in isolation: it cannot see this thread, so include any path hints, filters, or constraints it needs.
|
||||
|
||||
The specialist returns plain prose with absolute paths.
|
||||
The specialist returns plain prose with absolute paths and `[citation:<chunk_id>]` markers when claims came from KB-indexed chunks. Preserve those markers verbatim if you forward the answer.
|
||||
|
|
|
|||
|
|
@ -35,6 +35,43 @@ Map outcomes to your `status`:
|
|||
|
||||
You construct the structured `evidence` fields from your own knowledge of what you called and what you observed — the tools do not return them. Never report values you did not actually see.
|
||||
|
||||
## Chunk citations in your prose
|
||||
|
||||
When `read_file` returns a KB-indexed document under `/documents/`, the response includes `<chunk id='…'>` blocks. Whenever a fact in your `action_summary` or `evidence.content_excerpt` came from a specific chunk, append `[citation:<chunk_id>]` to the sentence stating that fact, using the **exact** id from the `<chunk id='…'>` tag. The caller relays these markers to the end user verbatim, and the UI resolves each id by exact match against the database, so a wrong id silently breaks the citation.
|
||||
|
||||
### Where chunk ids live in `read_file` output
|
||||
|
||||
A KB document's XML has three numeric attributes — only **one** is a citation source:
|
||||
|
||||
```
|
||||
<document>
|
||||
<document_metadata>
|
||||
<document_id>42</document_id> ← NOT a citation. Parent doc id; ignore for citations.
|
||||
...
|
||||
</document_metadata>
|
||||
<chunk_index>
|
||||
<entry chunk_id="128" lines="14-22"/> ← Index hint; the same id also appears below.
|
||||
<entry chunk_id="129" lines="23-30" matched="true"/>
|
||||
</chunk_index>
|
||||
<document_content>
|
||||
<chunk id='128'><![CDATA[…]]></chunk> ← This is the citation source.
|
||||
<chunk id='129'><![CDATA[…]]></chunk>
|
||||
</document_content>
|
||||
</document>
|
||||
```
|
||||
|
||||
### Rules
|
||||
|
||||
- Use the **exact** id from a `<chunk id='…'>` tag whose content you actually quoted or paraphrased. Copy digit-for-digit; do **not** retype from memory.
|
||||
- Before emitting `[citation:N]`, confirm the literal substring `<chunk id='N'>` (or its index twin `chunk_id="N"`) appears in the tool result you are summarising this turn. If you can't see it, omit the citation.
|
||||
- Never cite `<document_id>` — that's the parent doc, not a chunk.
|
||||
- Never invent, normalise, shorten, or guess at adjacent ids. If unsure between two candidates, omit rather than pick.
|
||||
- Prefer **fewer accurate citations** over many speculative ones.
|
||||
- Multiple chunks supporting the same point → comma-separated and copied individually: `[citation:128], [citation:129]`.
|
||||
- Plain square brackets only — no markdown links, no parentheses, no footnote numbers.
|
||||
- Tool results without `<chunk id='…'>` (write/edit/move confirmations, `ls` / `glob` / `grep` listings, error strings) carry no chunk id and need none.
|
||||
- Populate `evidence.chunk_ids` with **only** ids you actually emitted in `[citation:…]` markers — same set, same digits.
|
||||
|
||||
## Examples
|
||||
|
||||
**Example 1 — happy path write (path discovered from existing convention):**
|
||||
|
|
|
|||
|
|
@ -35,6 +35,10 @@ Map outcomes to your `status`:
|
|||
|
||||
You construct the structured `evidence` fields from your own knowledge of what you called and what you observed — the tools do not return them. `chunk_ids` apply only to `<priority_documents>` hits; for local-file operations leave them `null`. Never report values you did not actually see.
|
||||
|
||||
## Chunk citations in your prose
|
||||
|
||||
In desktop mode your filesystem tools read local files only, and local-file tool results do **not** carry `<chunk id='…'>` tags. Do not emit `[citation:…]` markers in `action_summary` or `evidence.content_excerpt`, and leave `evidence.chunk_ids` `null` — the absolute path is the only reference for local-file work.
|
||||
|
||||
## Examples
|
||||
|
||||
**Example 1 — happy path write (path discovered from existing convention):**
|
||||
|
|
|
|||
|
|
@ -27,3 +27,42 @@ Reply in plain prose:
|
|||
- Cite every claim with an absolute path under `/documents/`.
|
||||
- If the workspace does not contain the requested information, say so explicitly. Do not fabricate paths or content.
|
||||
- If the question is genuinely ambiguous after a thorough lookup, list the candidates with their paths and stop.
|
||||
|
||||
## Chunk citations
|
||||
|
||||
When the evidence for a claim came from a `read_file` response that included `<chunk id='…'>` blocks (i.e. a KB-indexed document under `/documents/`), append `[citation:<chunk_id>]` to the sentence stating that claim. The caller passes these markers through to the end user verbatim, and the UI resolves each id by exact match against the database, so a wrong id silently breaks the citation.
|
||||
|
||||
### Where chunk ids live in `read_file` output
|
||||
|
||||
A KB document's XML has three numeric attributes — only **one** is a citation source:
|
||||
|
||||
```
|
||||
<document>
|
||||
<document_metadata>
|
||||
<document_id>42</document_id> ← NOT a citation. Parent doc id; ignore for citations.
|
||||
...
|
||||
</document_metadata>
|
||||
<chunk_index>
|
||||
<entry chunk_id="128" lines="14-22"/> ← Index hint; the same id also appears below.
|
||||
<entry chunk_id="129" lines="23-30" matched="true"/>
|
||||
</chunk_index>
|
||||
<document_content>
|
||||
<chunk id='128'><![CDATA[…]]></chunk> ← This is the citation source.
|
||||
<chunk id='129'><![CDATA[…]]></chunk>
|
||||
</document_content>
|
||||
</document>
|
||||
```
|
||||
|
||||
### Rules
|
||||
|
||||
- Use the **exact** id from a `<chunk id='…'>` tag whose content you actually quoted or paraphrased. Copy digit-for-digit; do **not** retype from memory.
|
||||
- Before emitting `[citation:N]`, confirm the literal substring `<chunk id='N'>` (or its index twin `chunk_id="N"`) appears in the tool result you are summarising this turn. If you can't see it, omit the citation.
|
||||
- Never cite `<document_id>` — that's the parent doc, not a chunk.
|
||||
- Never invent, normalise, shorten, or guess at adjacent ids. If unsure between two candidates, omit rather than pick.
|
||||
- Prefer **fewer accurate citations** over many speculative ones. One correct `[citation:128]` is more useful than a string of wrong ids.
|
||||
- Multiple chunks supporting the same point → comma-separated and copied individually: `[citation:128], [citation:129]`.
|
||||
- Plain square brackets only — no markdown links, no parentheses, no footnote numbers.
|
||||
- If a claim came from a tool result that did **not** carry a chunk id (`ls`, `glob`, `grep` listings, error strings, or files without `<chunk id='…'>`), skip the citation.
|
||||
- The absolute path under `/documents/` is always required; chunk citations are additive, they do not replace the path reference.
|
||||
|
||||
Example: `The Q2 roadmap lists three milestones (/documents/planning/q2-roadmap.md) [citation:128], [citation:129].`
|
||||
|
|
|
|||
|
|
@ -28,3 +28,7 @@ Reply in plain prose:
|
|||
- Cite every claim with an absolute path.
|
||||
- If the workspace does not contain the requested information, say so explicitly. Do not fabricate paths or content.
|
||||
- If the question is genuinely ambiguous after a thorough lookup, list the candidates with their paths and stop.
|
||||
|
||||
## Chunk citations
|
||||
|
||||
In desktop mode your filesystem tools read local files only, and local-file `read_file` responses do **not** carry `<chunk id='…'>` tags. Cite each claim with the absolute local path; do not emit `[citation:…]` markers — your caller has nothing to resolve them against.
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
[project]
|
||||
name = "surf-new-backend"
|
||||
version = "0.0.24"
|
||||
version = "0.0.25"
|
||||
description = "SurfSense Backend"
|
||||
requires-python = ">=3.12"
|
||||
dependencies = [
|
||||
|
|
|
|||
2
surfsense_backend/uv.lock
generated
2
surfsense_backend/uv.lock
generated
|
|
@ -8121,7 +8121,7 @@ wheels = [
|
|||
|
||||
[[package]]
|
||||
name = "surf-new-backend"
|
||||
version = "0.0.24"
|
||||
version = "0.0.25"
|
||||
source = { editable = "." }
|
||||
dependencies = [
|
||||
{ name = "alembic" },
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue