mirror of
https://github.com/VectifyAI/PageIndex.git
synced 2026-05-19 18:35:16 +02:00
feat(collection): scoped query mode and experimental multi-doc warning
- get_agent_tools branches on doc_ids:
- scoped (doc_ids=[...]): drops list_documents and hard-enforces a
whitelist on the remaining tools; system prompt switches to
SCOPED_SYSTEM_PROMPT (no list_documents instruction); doc list +
summaries are prepended to the user message via wrap_with_doc_context.
- open (doc_ids=None): unchanged 4-tool agent loop.
- list_documents now exposes doc_description (sqlite + cloud).
- Collection.query emits UserWarning when doc_ids is None and the
collection holds >1 documents; PAGEINDEX_EXPERIMENTAL_MULTIDOC=1
silences it. Single-doc collections skip the warning; empty
collections raise ValueError.
- Agents SDK tracing upload disabled by default (avoids SSL timeouts);
PAGEINDEX_AGENTS_TRACING=1 re-enables it.
- README: new SDK Usage section covering local/cloud quick start,
streaming, multi-doc as experimental, and runnable examples.
This commit is contained in:
parent
cbea31d1a2
commit
d7b36aaf3f
8 changed files with 348 additions and 25 deletions
72
README.md
72
README.md
|
|
@ -139,6 +139,78 @@ You can generate the PageIndex tree structure with this open-source repo, or use
|
|||
|
||||
---
|
||||
|
||||
# 🚀 SDK Usage
|
||||
|
||||
A unified `PageIndexClient` powers both local self-hosted and cloud-managed modes. Mode is auto-detected by whether you pass an `api_key`.
|
||||
|
||||
### Install
|
||||
|
||||
```bash
|
||||
pip install pageindex
|
||||
```
|
||||
|
||||
### Quick start
|
||||
|
||||
```python
|
||||
from pageindex import PageIndexClient
|
||||
|
||||
# Local mode — uses your LLM key (e.g. OPENAI_API_KEY in env).
|
||||
client = PageIndexClient(model="gpt-4o-2024-11-20")
|
||||
|
||||
col = client.collection()
|
||||
doc_id = col.add("path/to/your.pdf")
|
||||
|
||||
print(col.query("What is the main contribution?", doc_ids=[doc_id]))
|
||||
|
||||
# Cloud mode — fully managed, no LLM key needed:
|
||||
# client = PageIndexClient(api_key="your-pageindex-api-key")
|
||||
```
|
||||
|
||||
`col.query(...)` returns the answer string by default. Always pass `doc_ids` for reliable single-document QA — omitting it queries the entire collection, which is experimental (see below).
|
||||
|
||||
### Streaming queries
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
|
||||
async def main():
|
||||
async for ev in col.query("Explain multi-head attention", stream=True):
|
||||
if ev.type == "answer_delta":
|
||||
print(ev.data, end="", flush=True)
|
||||
elif ev.type == "tool_call":
|
||||
print(f"\n[tool] {ev.data['name']}")
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
`ev.type` is one of: `tool_call`, `tool_result`, `answer_delta`, `answer_done`.
|
||||
|
||||
### Multi-document collections (experimental)
|
||||
|
||||
Passing `doc_ids` scopes the query to a specific subset of documents — this is the recommended path:
|
||||
|
||||
```python
|
||||
col.query("Compare these two papers", doc_ids=[doc1, doc2])
|
||||
```
|
||||
|
||||
Omitting `doc_ids` queries the **entire collection** and lets the agent pick which docs to read. This is an **experimental** feature with a naive first implementation — we're actively working on better cross-document retrieval. A `UserWarning` is emitted; set `PAGEINDEX_EXPERIMENTAL_MULTIDOC=1` to silence it.
|
||||
|
||||
### Environment variables
|
||||
|
||||
| Variable | Effect |
|
||||
|---|---|
|
||||
| `OPENAI_API_KEY` (or any LiteLLM `<PROVIDER>_API_KEY`) | LLM provider key — local mode |
|
||||
| `PAGEINDEX_API_KEY` | PageIndex cloud key — cloud mode |
|
||||
| `PAGEINDEX_EXPERIMENTAL_MULTIDOC` | Set to `1` to silence the warning when calling `col.query(...)` without `doc_ids` |
|
||||
|
||||
### Runnable examples
|
||||
|
||||
- [`examples/local_demo.py`](examples/local_demo.py) — local mode end-to-end (index a PDF + streaming QA)
|
||||
- [`examples/cloud_demo.py`](examples/cloud_demo.py) — cloud mode end-to-end
|
||||
- [`examples/agentic_vectorless_rag_demo.py`](examples/agentic_vectorless_rag_demo.py) — lower-level integration with the OpenAI Agents SDK
|
||||
|
||||
---
|
||||
|
||||
# ⚙️ Package Usage
|
||||
|
||||
You can follow these steps to generate a PageIndex tree from a PDF document.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue