mirror of https://github.com/VectifyAI/PageIndex.git synced 2026-04-24 23:56:21 +02:00

📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG https://pageindex.ai

agentic-ai agents ai ai-agents context-engineering information-retrieval llm rag reasoning retrieval retrieval-augmented-generation vector-database

Find a file

mingtian 4c1f356f0c add changelog		2025-04-03 13:36:32 +08:00
docs	first commit	2025-04-01 18:54:08 +08:00
results	add node id, node summary and doc summary	2025-04-03 13:35:38 +08:00
.gitignore	first commit	2025-04-01 18:54:08 +08:00
__init__.py	first commit	2025-04-01 18:54:08 +08:00
CHANGELOG.md	add changelog	2025-04-03 13:36:32 +08:00
LICENSE	first commit	2025-04-01 18:54:08 +08:00
page_index.py	add node id, node summary and doc summary	2025-04-03 13:35:38 +08:00
README.md	add node id, node summary and doc summary	2025-04-03 13:35:38 +08:00
requirements.txt	first commit	2025-04-01 18:54:08 +08:00
utils.py	add node id, node summary and doc summary	2025-04-03 13:35:38 +08:00

README.md

PageIndex

Document Index System for Reasoning-Based RAG

Frustrated with vector database retrieval accuracy for long professional documents? You need a reasoning-based native index for your RAG system.

Traditional vector-based retrieval relies heavily on semantic similarity. But when working with professional documents that require domain expertise and multi-step reasoning, similarity search often falls short.

Reasoning-Based RAG offers a better alternative: enabling LLMs to think and reason their way to the most relevant document sections. Inspired by AlphaGo, we leverage tree search to perform structured document retrieval.

PageIndex is an indexing system that builds search trees from long documents, making them ready for reasoning-based RAG.

Built by Vectify AI

🔍 What is PageIndex?

PageIndex transforms lengthy PDF documents into a semantic tree structure, similar to a "table of contents" but optimized for use with Large Language Models (LLMs). It’s ideal for: financial reports, regulatory filings, academic textbooks, legal or technical manuals or any document that exceeds LLM context limits.

✅ Key Features

Scales to Massive Documents
Designed to handle hundreds or even thousands of pages with ease.
Hierarchical Tree Structure
Enables LLMs to traverse documents logically—like an intelligent, LLM-optimized table of contents.
Precise Page Referencing
Every node contains its own summary and start/end page physical index, allowing pinpoint retrieval.
Chunk-Free Segmentation
No arbitrary chunking. Nodes follow the natural structure of the document.

📦 PageIndex Format

Here is an example output. See more example documents and generated trees.

{
  "title": "Financial Stability",
  "node_id": "0006",
  "start_index": 21,
  "end_index": 22,
  "summary": "The Federal Reserve ...",
  "nodes": [
    {
      "title": "Monitoring Financial Vulnerabilities",
      "node_id": "0007",
      "start_index": 22,
      "end_index": 28,
      "summary": "The Federal Reserve's monitoring ..."
    },
    {
      "title": "Domestic and International Cooperation and Coordination",
      "node_id": "0008",
      "start_index": 28,
      "end_index": 31,
      "summary": "In 2023, the Federal Reserve collaborated ..."
    }
  ]
}

Notice: the node_id and summary generation function will be added soon.

🧠 Reasoning-Based RAG with PageIndex

Use PageIndex to build reasoning-based retrieval systems without relying on semantic similarity. Great for domain-specific tasks where nuance matters.

🛠️ Example Prompt

prompt = f"""
You are given a question and a tree structure of a document.
You need to find all nodes that are likely to contain the answer.

Question: {question}

Document tree structure: {structure}

Reply in the following JSON format:
{{
  "thinking": <reasoning about where to look>,
  "node_list": [node_id1, node_id2, ...]
}}
"""

🚀 Usage

Follow these steps to generate a PageIndex tree from a PDF document.

1. Install dependencies

pip3 install -r requirements.txt

2. Set your OpenAI API key

Create a .env file in the root directory and add your API key:

CHATGPT_API_KEY=your_openai_key_here

3. Run PageIndex on your PDF

python3 page_index.py --pdf_path /path/to/your/document.pdf

You can customize the processing with additional optional arguments:

--model                 OpenAI model to use (default: gpt-4o-2024-11-20)
--toc-check-pages       Pages to check for table of contents (default: 20)
--max-pages-per-node    Max pages per node (default: 10)
--max-tokens-per-node   Max tokens per node (default: 20000)
--if-add-node-id        Add node ID (yes/no, default: yes)
--if-add-node-summary   Add node summary (yes/no, default: no)
--if-add-doc-description Add doc description (yes/no, default: yes)

🛤 Roadmap

Document-level retrieval
Technical report on PageIndex design
Efficient tree search algorithms for large documents
Integration with vector-based semantic retrieval

📈 Case Study: Mafin 2.5

Mafin 2.5 is a state-of-the-art reasoning-based RAG model designed specifically for financial document analysis. Built on top of PageIndex, it achieved an impressive 98.7% accuracy on the FinanceBench benchmark—significantly outperforming traditional vector-based RAG systems.

PageIndex’s hierarchical indexing enabled precise navigation and extraction of relevant content from complex financial reports, such as SEC filings and earnings disclosures.

👉 See full benchmark results for detailed comparisons and performance metrics.

📬 Contact Us

Need customized support for your documents or reasoning-based RAG system?

📢 Join our Discord

✉️ Leave us a Message

README.md Unescape Escape