feat(collection): doc_ids accepts str|list, design cleanups

- Collection.query and Backend.query/query_stream accept doc_ids as
  str, list[str] or None. Single str is normalized to [str] inside each
  backend; bare [] is rejected with ValueError at both layers.
- wrap_with_doc_context wraps the scoped doc list in <docs>...</docs>
  and SCOPED_SYSTEM_PROMPT instructs the agent to treat that block as
  data, not instructions (defense against prompt injection via
  auto-generated doc_description).
- _require_cloud_api now distinguishes api_key="" from api_key=None;
  the former gives a targeted error pointing at the empty-string vs
  fall-back-to-local situation when legacy SDK methods are called.
- Legacy PageIndexClient.list_documents docstring spells out the
  return-shape difference vs collection.list_documents() to flag a
  silent migration footgun (paginated dict with id/name keys vs plain
  list[dict] with doc_id/doc_name keys).
- Remove dead CloudBackend.get_agent_tools stub (not on the Backend
  protocol; only ever returned an empty AgentTools()) and the
  SYSTEM_PROMPT alias (OPEN_/SCOPED_SYSTEM_PROMPT are the explicit
  names now).
- README quick start and streaming example now pass doc_ids; new
  multi-document section shows both str and list forms.
- examples/demo_query_modes.py exercises all five query-mode cases
  (single-doc, multi-doc with/without env var, scoped single, scoped
  multi) for manual verification.
This commit is contained in:
mountain 2026-05-15 17:03:17 +08:00
parent d7b36aaf3f
commit a47c36a3f5
13 changed files with 322 additions and 45 deletions

View file

@ -160,7 +160,7 @@ client = PageIndexClient(model="gpt-4o-2024-11-20")
col = client.collection()
doc_id = col.add("path/to/your.pdf")
print(col.query("What is the main contribution?", doc_ids=[doc_id]))
print(col.query("What is the main contribution?", doc_ids=doc_id))
# Cloud mode — fully managed, no LLM key needed:
# client = PageIndexClient(api_key="your-pageindex-api-key")
@ -174,7 +174,7 @@ print(col.query("What is the main contribution?", doc_ids=[doc_id]))
import asyncio
async def main():
async for ev in col.query("Explain multi-head attention", stream=True):
async for ev in col.query("Explain multi-head attention", doc_ids=doc_id, stream=True):
if ev.type == "answer_delta":
print(ev.data, end="", flush=True)
elif ev.type == "tool_call":
@ -187,10 +187,11 @@ asyncio.run(main())
### Multi-document collections (experimental)
Passing `doc_ids` scopes the query to a specific subset of documents — this is the recommended path:
Passing `doc_ids` scopes the query to a specific subset of documents — this is the recommended path. `doc_ids` accepts a single id (`str`) or a list:
```python
col.query("Compare these two papers", doc_ids=[doc1, doc2])
col.query("What does this paper say?", doc_ids=doc1) # single
col.query("Compare these two papers", doc_ids=[doc1, doc2]) # multi
```
Omitting `doc_ids` queries the **entire collection** and lets the agent pick which docs to read. This is an **experimental** feature with a naive first implementation — we're actively working on better cross-document retrieval. A `UserWarning` is emitted; set `PAGEINDEX_EXPERIMENTAL_MULTIDOC=1` to silence it.