mirror of https://github.com/VectifyAI/PageIndex.git synced 2026-05-19 18:35:16 +02:00

mountain d7b36aaf3f feat(collection): scoped query mode and experimental multi-doc warning

- get_agent_tools branches on doc_ids:
  - scoped (doc_ids=[...]): drops list_documents and hard-enforces a
    whitelist on the remaining tools; system prompt switches to
    SCOPED_SYSTEM_PROMPT (no list_documents instruction); doc list +
    summaries are prepended to the user message via wrap_with_doc_context.
  - open (doc_ids=None): unchanged 4-tool agent loop.
- list_documents now exposes doc_description (sqlite + cloud).
- Collection.query emits UserWarning when doc_ids is None and the
  collection holds >1 documents; PAGEINDEX_EXPERIMENTAL_MULTIDOC=1
  silences it. Single-doc collections skip the warning; empty
  collections raise ValueError.
- Agents SDK tracing upload disabled by default (avoids SSL timeouts);
  PAGEINDEX_AGENTS_TRACING=1 re-enables it.
- README: new SDK Usage section covering local/cloud quick start,
  streaming, multi-doc as experimental, and runnable examples.

2026-05-15 11:14:12 +08:00

17 KiB

Raw Blame History

PageIndex: Vectorless, Reasoning-based RAG

Reasoning-based RAG ◦ No Vector DB ◦ No Chunking ◦ Human-like Retrieval

🌐 Homepage • 🖥️ Chat Platform • 🔌 MCP & API • 📖 Docs • 💬 Discord • ✉️ Contact

📢 Updates

🔥 Agentic Vectorless RAG — A simple agentic, vectorless RAG example with self-hosted PageIndex, using OpenAI Agents SDK.
PageIndex Chat — Human-like document analysis agent platform for professional long documents. Also available via MCP or API.
PageIndex Framework — Deep dive into PageIndex: an agentic, in-context tree index that enables LLMs to perform reasoning-based, human-like retrieval over long documents.

📑 Introduction to PageIndex

Are you frustrated with vector database retrieval accuracy for long professional documents? Traditional vector-based RAG relies on semantic similarity rather than true relevance. But similarity ≠ relevance — what we truly need in retrieval is relevance, and that requires reasoning. When working with professional documents that demand domain expertise and multi-step reasoning, similarity search often falls short.

Inspired by AlphaGo, we propose PageIndex — a vectorless, reasoning-based RAG system that builds a hierarchical tree index from long documents and uses LLMs to reason over that index for agentic, context-aware retrieval. It simulates how human experts navigate and extract knowledge from complex documents through tree search, enabling LLMs to think and reason their way to the most relevant document sections. PageIndex performs retrieval in two steps:

Generate a “Table-of-Contents” tree structure index of documents
Perform reasoning-based retrieval through tree search

🎯 Core Features

Compared to traditional vector-based RAG, PageIndex features:

No Vector DB: Uses document structure and LLM reasoning for retrieval, instead of vector similarity search.
No Chunking: Documents are organized into natural sections, not artificial chunks.
Human-like Retrieval: Simulates how human experts navigate and extract knowledge from complex documents.
Better Explainability and Traceability: Retrieval is based on reasoning — traceable and interpretable, with page and section references. No more opaque, approximate vector search (“vibe retrieval”).

PageIndex powers a reasoning-based RAG system that achieved state-of-the-art 98.7% accuracy on FinanceBench, demonstrating superior performance over vector-based RAG solutions in professional document analysis. See our blog post for details.

📍 Explore PageIndex

To learn more, please see a detailed introduction to the PageIndex framework. Check out this GitHub repo for open-source code, and the cookbooks, tutorials, and blog for additional usage guides and examples.

The PageIndex service is available as a ChatGPT-style chat platform, or can be integrated via MCP or API.

🛠️ Deployment Options

Self-host — run locally with this open-source repo.
Cloud Service — try instantly with our Chat Platform, or integrate via MCP or API.
Enterprise — private or on-prem deployment. Contact us or book a demo for more details.

🧪 Quick Hands-on

🔥 Agentic Vectorless RAG (latest) — a simple but complete agentic vectorless RAG example with self-hosted PageIndex, using OpenAI Agents SDK.
Try the Vectorless RAG notebook — a minimal, hands-on example of reasoning-based RAG using PageIndex.
Check out Vision-based Vectorless RAG — no OCR; a minimal, vision-based & reasoning-native RAG pipeline that works directly over page images.

🌲 PageIndex Tree Structure

PageIndex can transform lengthy PDF documents into a semantic tree structure, similar to a "table of contents" but optimized for use with Large Language Models (LLMs). It's ideal for: financial reports, regulatory filings, academic textbooks, legal or technical manuals, and any document that exceeds LLM context limits.

Below is an example PageIndex tree structure. Also see more example documents and generated tree structures.

...
{
  "title": "Financial Stability",
  "node_id": "0006",
  "start_index": 21,
  "end_index": 22,
  "summary": "The Federal Reserve ...",
  "nodes": [
    {
      "title": "Monitoring Financial Vulnerabilities",
      "node_id": "0007",
      "start_index": 22,
      "end_index": 28,
      "summary": "The Federal Reserve's monitoring ..."
    },
    {
      "title": "Domestic and International Cooperation and Coordination",
      "node_id": "0008",
      "start_index": 28,
      "end_index": 31,
      "summary": "In 2023, the Federal Reserve collaborated ..."
    }
  ]
}
...

You can generate the PageIndex tree structure with this open-source repo, or use our API.

🚀 SDK Usage

A unified PageIndexClient powers both local self-hosted and cloud-managed modes. Mode is auto-detected by whether you pass an api_key.

Install

pip install pageindex

Quick start

from pageindex import PageIndexClient

# Local mode — uses your LLM key (e.g. OPENAI_API_KEY in env).
client = PageIndexClient(model="gpt-4o-2024-11-20")

col = client.collection()
doc_id = col.add("path/to/your.pdf")

print(col.query("What is the main contribution?", doc_ids=[doc_id]))

# Cloud mode — fully managed, no LLM key needed:
# client = PageIndexClient(api_key="your-pageindex-api-key")

col.query(...) returns the answer string by default. Always pass doc_ids for reliable single-document QA — omitting it queries the entire collection, which is experimental (see below).

Streaming queries

import asyncio

async def main():
    async for ev in col.query("Explain multi-head attention", stream=True):
        if ev.type == "answer_delta":
            print(ev.data, end="", flush=True)
        elif ev.type == "tool_call":
            print(f"\n[tool] {ev.data['name']}")

asyncio.run(main())

ev.type is one of: tool_call, tool_result, answer_delta, answer_done.

Multi-document collections (experimental)

Passing doc_ids scopes the query to a specific subset of documents — this is the recommended path:

col.query("Compare these two papers", doc_ids=[doc1, doc2])

Omitting doc_ids queries the entire collection and lets the agent pick which docs to read. This is an experimental feature with a naive first implementation — we're actively working on better cross-document retrieval. A UserWarning is emitted; set PAGEINDEX_EXPERIMENTAL_MULTIDOC=1 to silence it.

Environment variables

Variable	Effect
`OPENAI_API_KEY` (or any LiteLLM `<PROVIDER>_API_KEY`)	LLM provider key — local mode
`PAGEINDEX_API_KEY`	PageIndex cloud key — cloud mode
`PAGEINDEX_EXPERIMENTAL_MULTIDOC`	Set to `1` to silence the warning when calling `col.query(...)` without `doc_ids`

Runnable examples

examples/local_demo.py — local mode end-to-end (index a PDF + streaming QA)
examples/cloud_demo.py — cloud mode end-to-end
examples/agentic_vectorless_rag_demo.py — lower-level integration with the OpenAI Agents SDK

⚙️ Package Usage

You can follow these steps to generate a PageIndex tree from a PDF document.

1. Install dependencies

pip3 install --upgrade -r requirements.txt

2. Set your LLM API key

Create a .env file in the root directory with your LLM API key, with multi-LLM support via LiteLLM:

OPENAI_API_KEY=your_openai_key_here

3. Generate PageIndex structure for your PDF

python3 run_pageindex.py --pdf_path /path/to/your/document.pdf

Optional parameters

You can customize the processing with additional optional arguments:

--model                 LLM model to use (default: gpt-4o-2024-11-20)
--toc-check-pages       Pages to check for table of contents (default: 20)
--max-pages-per-node    Max pages per node (default: 10)
--max-tokens-per-node   Max tokens per node (default: 20000)
--if-add-node-id        Add node ID (yes/no, default: yes)
--if-add-node-summary   Add node summary (yes/no, default: yes)
--if-add-doc-description Add doc description (yes/no, default: yes)

Markdown support

We also provide markdown support for PageIndex. You can use the `--md_path` flag to generate a tree structure for a markdown file.

python3 run_pageindex.py --md_path /path/to/your/document.md

Note: in this mode, we use "#" to determine node headings and their levels. For example, "##" is level 2, "###" is level 3, etc. Make sure your markdown file is formatted correctly. If your Markdown file was converted from a PDF or HTML, we don't recommend using this mode, since most existing conversion tools cannot preserve the original hierarchy. Instead, use our PageIndex OCR, which is designed to preserve the original hierarchy, to convert the PDF to a markdown file and then use this mode.

Agentic Vectorless RAG: An Example

For a simple, end-to-end agentic vectorless RAG example using PageIndex with OpenAI Agents SDK, see examples/agentic_vectorless_rag_demo.py.

# Install optional dependency
pip3 install openai-agents

# Run the demo
python3 examples/agentic_vectorless_rag_demo.py

📈 Case Study: PageIndex Leads Finance QA Benchmark

Mafin 2.5 is a reasoning-based RAG system for financial document analysis, powered by PageIndex. It achieved a state-of-the-art 98.7% accuracy on the FinanceBench benchmark, significantly outperforming traditional vector-based RAG systems.

PageIndex's hierarchical indexing and reasoning-driven retrieval enable precise navigation and extraction of relevant context from complex financial reports, such as SEC filings and earnings disclosures.

Explore the full benchmark results and our blog post for detailed comparisons and performance metrics.

🧭 Resources

📝 Blog: technical articles, research insights, and product updates.
🔧 Developer: MCP setup, API docs, and integration guides.
🧪 Cookbooks: hands-on, runnable examples and advanced use cases.
📖 Tutorials: practical guides and strategies, including Document Search and Tree Search.

⭐ Support Us

Leave us a star 🌟 if you like our project. Thank you!

Please cite this work as:

Mingtian Zhang, Yu Tang and PageIndex Team,
"PageIndex: Next-Generation Vectorless, Reasoning-based RAG",
PageIndex Blog, Sep 2025.

Or use the BibTeX citation.

@article{zhang2025pageindex,
  author = {Mingtian Zhang and Yu Tang and PageIndex Team},
  title = {PageIndex: Next-Generation Vectorless, Reasoning-based RAG},
  journal = {PageIndex Blog},
  year = {2025},
  month = {September},
  note = {https://pageindex.ai/blog/pageindex-intro},
}

17 KiB Raw Blame History Unescape Escape