PageIndex/README.md

<div align="center">
  
<a href="https://vectify.ai/pageindex" target="_blank">
  <img src="https://github.com/user-attachments/assets/46201e72-675b-43bc-bfbd-081cc6b65a1d" alt="PageIndex Banner" />
</a>

<br/>
<br/>

<p align="center">
  <a href="https://trendshift.io/repositories/14736" target="_blank"><img src="https://trendshift.io/api/badge/repositories/14736" alt="VectifyAI%2FPageIndex | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
</p>

<p align="center"><b>Reasoning-based RAG&nbsp; ◦ &nbsp;No Vector DB&nbsp; ◦ &nbsp;No Chunking&nbsp; ◦ &nbsp;Human-like Retrieval</b></p>

<h4 align="center">
  <a href="https://vectify.ai">🏠 Homepage</a>&nbsp; • &nbsp;
  <a href="https://chat.pageindex.ai">🖥️ Chat Platform</a>&nbsp; • &nbsp;
  <a href="https://pageindex.ai/mcp">🔌 MCP</a>&nbsp; • &nbsp;
  <a href="https://docs.pageindex.ai/quickstart">📚 API Docs</a>&nbsp; • &nbsp;
  <a href="https://discord.com/invite/VuXuf29EUj">💬 Discord</a>&nbsp; • &nbsp;
  <a href="https://ii2abc2jejf.typeform.com/to/tK3AXl8T">✉️ Contact</a>&nbsp;
</h4>
  
</div>

---

<details open>
<summary><h2>📢 Recent Updates</h2></summary>

 **🔥 New Releases:**
- [**PageIndex Chat**](https://chat.pageindex.ai): The first human-like document-analysis agent platform built for professional long documents. It can also be integrated via the [MCP](https://pageindex.ai/mcp) or [API](https://docs.pageindex.ai/quickstart) (beta).
<!-- - [**PageIndex Chat API**](https://docs.pageindex.ai/quickstart): An API that brings PageIndex’s advanced long-document intelligence directly into your applications and workflows. -->
<!-- - [PageIndex MCP](https://pageindex.ai/mcp): Bring PageIndex into Claude, Cursor, or any MCP-enabled agent. Chat with long PDFs in a reasoning-based, human-like way. -->
 
 **✍️ Articles:**
- [**PageIndex Framework**](https://pageindex.ai/blog/pageindex-intro): Introduces the PageIndex framework — an *agentic, in-context* *tree index* that enables LLMs to perform *reasoning-based*, *human-like retrieval* over long documents, without vector DB or chunking.
<!-- - [Do We Still Need OCR?](https://pageindex.ai/blog/do-we-need-ocr): Explores how vision-based, reasoning-native RAG challenges the traditional OCR pipeline, and why the future of document AI might be *vectorless* and *vision-based*. -->

 **🧪 Cookbooks:**
- [Vectorless RAG](https://docs.pageindex.ai/cookbook/vectorless-rag-pageindex): A minimal, hands-on example of reasoning-based RAG using PageIndex — no vectors, no chunking, and human-like retrieval.
- [Vision-based vectorless RAG](https://docs.pageindex.ai/cookbook/vision-rag-pageindex): Experience OCR-free document understanding through PageIndex’s visual retrieval workflow that retrieves and reasons directly over PDF page images.
</details>

# 📑 Introduction to PageIndex

Are you frustrated with vector database retrieval accuracy for long professional documents? Traditional vector-based RAG relies on semantic *similarity* rather than true *relevance*. But **similarity ≠ relevance** — what we truly need in retrieval is **relevance**, and that requires **reasoning**. When working with professional documents that demand domain expertise and multi-step reasoning, similarity search often falls short.

Inspired by AlphaGo, we propose **[PageIndex](https://vectify.ai/pageindex)** — a **_vectorless_**, **reasoning-based RAG** system that builds a *hierarchical tree index* for long documents and *reasons* over that index for *retrieval*. It simulates how **human experts** navigate and extract knowledge from complex documents through **tree search**, enabling LLMs to *think* and *reason* their way to the most relevant document sections. PageIndex performs retrieval in two steps:

1. Generate a "Table-of-Contents" **tree structure index** of documents
2. Perform reasoning-based retrieval through **tree search**

<div align="center">
    <img src="https://docs.pageindex.ai/images/cookbook/vectorless-rag.png" width="70%">
</div>

### 🧩 Features 

Compared to traditional *vector-based RAG*, **PageIndex** features:
- **No Vector DB**: Uses document structure and LLM reasoning for retrieval, instead of vector similarity search.
- **No Chunking**: Documents are organized into natural sections, not artificial chunks.
- **Human-like Retrieval**: Simulates how human experts navigate and extract knowledge from complex documents.
- **Better Explainability and Traceability**: Retrieval is based on reasoning — traceable and interpretable, with page and section references. No more opaque, approximate vector search ("vibe retrieval").

PageIndex powers a reasoning-based RAG system that achieved [98.7% accuracy](https://github.com/VectifyAI/Mafin2.5-FinanceBench) on FinanceBench, demonstrating **state-of-the-art** performance in professional document analysis (see our [blog post](https://vectify.ai/blog/Mafin2.5) for details).

### 📍 Explore PageIndex

Please see a detailed introduction of the [PageIndex framework](https://pageindex.ai/blog/pageindex-intro). Check out our [GitHub repo](https://github.com/VectifyAI/PageIndex) for open-source code, and [cookbooks](https://docs.pageindex.ai/cookbook) and [tutorials](https://docs.pageindex.ai/tutorials) for additional usage guides and examples. The PageIndex service is available as a ChatGPT-style [chat platform](https://chat.pageindex.ai), or could be integrated via [MCP](https://pageindex.ai/mcp) or [API](https://docs.pageindex.ai/quickstart).

### ⚙️ Deployment Options
- 🛠️ Self-host — run locally with this open-source repo.
- ☁️ **Cloud Service** — try instantly with our 🖥️ [Chat Platform](https://chat.pageindex.ai/), 🔌 [MCP](https://pageindex.ai/mcp) or 📚 [API](https://docs.pageindex.ai/quickstart).

### 🧪 Quick Hands-on

- Try the [_**Vectorless RAG Notebook**_](https://github.com/VectifyAI/PageIndex/blob/main/cookbook/pageindex_RAG_simple.ipynb) — a *minimal*, hands-on example of reasoning-based RAG using PageIndex.
- Experiment with the [*Vision-based vectorless RAG*](https://github.com/VectifyAI/PageIndex/blob/main/cookbook/vision_RAG_pageindex.ipynb) — no OCR; a minimal, reasoning-native RAG pipeline that works directly over page images.
  
<div align="center">
  <a href="https://colab.research.google.com/github/VectifyAI/PageIndex/blob/main/cookbook/pageindex_RAG_simple.ipynb" target="_blank" rel="noopener">
    <img src="https://img.shields.io/badge/Open_In_Colab-Vectorless_RAG-orange?style=for-the-badge&logo=googlecolab" alt="Open in Colab: Vectorless RAG" />
  </a>
  &nbsp;&nbsp;
  <a href="https://colab.research.google.com/github/VectifyAI/PageIndex/blob/main/cookbook/vision_RAG_pageindex.ipynb" target="_blank" rel="noopener">
    <img src="https://img.shields.io/badge/Open_In_Colab-Vision_RAG-orange?style=for-the-badge&logo=googlecolab" alt="Open in Colab: Vision RAG" />
  </a>
</div>

---

# 🌲 PageIndex Tree Structure
PageIndex can transform lengthy PDF documents into a semantic **tree structure**, similar to a _"table of contents"_ but optimized for use with Large Language Models (LLMs). It's ideal for: financial reports, regulatory filings, academic textbooks, legal or technical manuals, and any document that exceeds LLM context limits.

Here is an example output. See more [example documents](https://github.com/VectifyAI/PageIndex/tree/main/tests/pdfs) and [generated trees](https://github.com/VectifyAI/PageIndex/tree/main/tests/results).

```jsonc
...
{
  "title": "Financial Stability",
  "node_id": "0006",
  "start_index": 21,
  "end_index": 22,
  "summary": "The Federal Reserve ...",
  "nodes": [
    {
      "title": "Monitoring Financial Vulnerabilities",
      "node_id": "0007",
      "start_index": 22,
      "end_index": 28,
      "summary": "The Federal Reserve's monitoring ..."
    },
    {
      "title": "Domestic and International Cooperation and Coordination",
      "node_id": "0008",
      "start_index": 28,
      "end_index": 31,
      "summary": "In 2023, the Federal Reserve collaborated ..."
    }
  ]
}
...
```

 You can either generate the PageIndex tree structure with this open-source repo, or try our [API](https://docs.pageindex.ai/quickstart) service.

---

# 📦 Package Usage

You can follow these steps to generate a PageIndex tree from a PDF document.

### 1. Install dependencies

```bash
pip3 install --upgrade -r requirements.txt
```

### 2. Set your OpenAI API key

Create a `.env` file in the root directory and add your API key:

```bash
CHATGPT_API_KEY=your_openai_key_here
```

### 3. Run PageIndex on your PDF

```bash
python3 run_pageindex.py --pdf_path /path/to/your/document.pdf
```

<details>
<summary><strong>Optional parameters</strong></summary>
<br>
You can customize the processing with additional optional arguments:

```
--model                 OpenAI model to use (default: gpt-4o-2024-11-20)
--toc-check-pages       Pages to check for table of contents (default: 20)
--max-pages-per-node    Max pages per node (default: 10)
--max-tokens-per-node   Max tokens per node (default: 20000)
--if-add-node-id        Add node ID (yes/no, default: yes)
--if-add-node-summary   Add node summary (yes/no, default: yes)
--if-add-doc-description Add doc description (yes/no, default: yes)
```
</details>

<details>
<summary><strong>Markdown support</strong></summary>
<br>
We also provide markdown support for PageIndex. You can use the `-md_path` flag to generate a tree structure for a markdown file.

```bash
python3 run_pageindex.py --md_path /path/to/your/document.md
```

> Note: in this function, we use "#" to determine node heading and their levels. For example, "##" is level 2, "###" is level 3, etc. Make sure your markdown file is formatted correctly. If your Markdown file was converted from a PDF or HTML, we don’t recommend using this function, since most existing conversion tools cannot preserve the original hierarchy. Instead, use our [PageIndex OCR](https://pageindex.ai/blog/ocr), which is designed to preserve the original hierarchy, to convert the PDF to a markdown file and then use this function.
</details>

---

<!-- # ☁️ Improved Tree Generation with PageIndex OCR

This repo is designed for generating PageIndex tree structure for simple PDFs, but many real-world use cases involve complex PDFs that are hard to parse by classic Python tools. However, extracting high-quality text from PDF documents remains a non-trivial challenge. Most OCR tools only extract page-level content, losing the broader document context and hierarchy.

To address this, we introduced PageIndex OCR — the first long-context OCR model designed to preserve the global structure of documents. PageIndex OCR significantly outperforms other leading OCR tools, such as those from Mistral and Contextual AI, in recognizing true hierarchy and semantic relationships across document pages.

- Experience next-level OCR quality with PageIndex OCR at our [Dashboard](https://dash.pageindex.ai/).
- Integrate PageIndex OCR seamlessly into your stack via our [API](https://docs.pageindex.ai/quickstart).

<p align="center">
  <img src="https://github.com/user-attachments/assets/eb35d8ae-865c-4e60-a33b-ebbd00c41732" width="80%">
</p>

--- -->

# 📈 Case Study: PageIndex Leads Finance QA Benchmark

[Mafin 2.5](https://vectify.ai/mafin) is a reasoning-based RAG system for financial document analysis, powered by **PageIndex**. It achieved a state-of-the-art [**98.7% accuracy**](https://vectify.ai/blog/Mafin2.5) on the [FinanceBench](https://arxiv.org/abs/2311.11944) benchmark — significantly outperforming traditional vector-based RAG systems.

PageIndex's hierarchical indexing enabled precise navigation and extraction of relevant content from complex financial reports, such as SEC filings and earnings disclosures.

👉 Explore the full [benchmark results](https://github.com/VectifyAI/Mafin2.5-FinanceBench) and our [blog post](https://vectify.ai/blog/Mafin2.5) for detailed comparisons and performance metrics.

<div align="center">
  <a href="https://github.com/VectifyAI/Mafin2.5-FinanceBench">
    <img src="https://github.com/user-attachments/assets/571aa074-d803-43c7-80c4-a04254b782a3" width="70%">
  </a>
</div>

---

# 🧭 Resources

* 📖 [Tutorials](https://docs.pageindex.ai/doc-search): practical guides and strategies, including *Document Search* and *Tree Search*.
* 🧪 [Cookbooks](https://docs.pageindex.ai/cookbook/vectorless-rag-pageindex): hands-on, runnable examples and advanced use cases.
* 📝 [Blog](https://pageindex.ai/blog): technical articles, research insights, and product updates
* ⚙️ [MCP setup](https://pageindex.ai/mcp#quick-setup) & [API docs](https://docs.pageindex.ai/quickstart): integration details and configuration options.

---

### ⭐ Support Us

Leave a star if you like our project. Thank you!  

<p>
  <img src="https://github.com/user-attachments/assets/eae4ff38-48ae-4a7c-b19f-eab81201d794" width="80%">
</p>

### Connect with Us

[![Twitter](https://img.shields.io/badge/Twitter-000000?style=for-the-badge&logo=x&logoColor=white)](https://x.com/VectifyAI)&nbsp;
[![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/company/vectify-ai/)&nbsp;
[![Discord](https://img.shields.io/badge/Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white)](https://discord.com/invite/VuXuf29EUj)&nbsp;
[![Contact Us](https://img.shields.io/badge/Contact_Us-3B82F6?style=for-the-badge&logo=envelope&logoColor=white)](https://ii2abc2jejf.typeform.com/to/tK3AXl8T)

---

© 2025 [Vectify AI](https://vectify.ai)
-												Update README.md
											
										
										
											2025-04-30 20:51:59 +07:00
+								<div align="center">
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
 								<a href="https://vectify.ai/pageindex" target="_blank">
 								  <img src="https://github.com/user-attachments/assets/46201e72-675b-43bc-bfbd-081cc6b65a1d" alt="PageIndex Banner" />
 								</a>
 								<br/>
 								<br/>
 								<p align="center">
 								  <a href="https://trendshift.io/repositories/14736" target="_blank"><img src="https://trendshift.io/api/badge/repositories/14736" alt="VectifyAI%2FPageIndex | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
 								</p>
-												Update README.md
											
										
										
											2025-11-21 01:25:46 +08:00
+								<p align="center"><b>Reasoning-based RAG&nbsp; ◦ &nbsp;No Vector DB&nbsp; ◦ &nbsp;No Chunking&nbsp; ◦ &nbsp;Human-like Retrieval</b></p>
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
-												Update README.md
											
										
										
											2025-10-14 23:11:39 +08:00
+								<h4 align="center">
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								  <a href="https://vectify.ai">🏠 Homepage</a>&nbsp; • &nbsp;
-												Update README.md
											
										
										
											2025-12-19 03:50:44 +08:00
+								  <a href="https://chat.pageindex.ai">🖥️ Chat Platform</a>&nbsp; • &nbsp;
-												Update README.md
											
										
										
											2025-10-14 23:11:39 +08:00
+								  <a href="https://pageindex.ai/mcp">🔌 MCP</a>&nbsp; • &nbsp;
-												Update README.md
											
										
										
											2025-12-19 03:50:44 +08:00
+								  <a href="https://docs.pageindex.ai/quickstart">📚 API Docs</a>&nbsp; • &nbsp;
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								  <a href="https://discord.com/invite/VuXuf29EUj">💬 Discord</a>&nbsp; • &nbsp;
 								  <a href="https://ii2abc2jejf.typeform.com/to/tK3AXl8T">✉️ Contact</a>&nbsp;
-												Update README.md
											
										
										
											2025-10-14 23:11:39 +08:00
+								</h4>
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
-												Update README.md
											
										
										
											2025-08-07 23:05:02 +01:00
+								</div>
-												Update README.md
											
										
										
											2025-06-17 12:25:25 +01:00
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								---
-												first commit

											
										
										
											2025-04-01 18:54:08 +08:00
-												Update README.md
											
										
										
											2025-11-13 03:01:09 +08:00
+								<details open>
-												Update README.md
											
										
										
											2025-11-13 03:08:26 +08:00
+								<summary><h2>📢 Recent Updates</h2></summary>
-												Update README.md
											
										
										
											2025-09-20 01:10:25 +08:00
-												Update README.md
											
										
										
											2025-12-19 03:50:44 +08:00
+								 **🔥 New Releases:**
-												Update README.md
											
										
										
											2025-12-19 05:06:46 +08:00
+								- [**PageIndex Chat**](https://chat.pageindex.ai): The first human-like document-analysis agent platform built for professional long documents. It can also be integrated via the [MCP](https://pageindex.ai/mcp) or [API](https://docs.pageindex.ai/quickstart) (beta).
-												Update README.md

											
										
										
											2025-11-21 01:30:22 +08:00
+								<!-- - [**PageIndex Chat API**](https://docs.pageindex.ai/quickstart): An API that brings PageIndex’s advanced long-document intelligence directly into your applications and workflows. -->
-												Update README.md
											
										
										
											2025-12-19 03:50:44 +08:00
+								<!-- - [PageIndex MCP](https://pageindex.ai/mcp): Bring PageIndex into Claude, Cursor, or any MCP-enabled agent. Chat with long PDFs in a reasoning-based, human-like way. -->
 								 **✍️ Articles:**
 								- [**PageIndex Framework**](https://pageindex.ai/blog/pageindex-intro): Introduces the PageIndex framework — an *agentic, in-context* *tree index* that enables LLMs to perform *reasoning-based*, *human-like retrieval* over long documents, without vector DB or chunking.
 								<!-- - [Do We Still Need OCR?](https://pageindex.ai/blog/do-we-need-ocr): Explores how vision-based, reasoning-native RAG challenges the traditional OCR pipeline, and why the future of document AI might be *vectorless* and *vision-based*. -->
-												Update README.md

											
										
										
											2025-11-05 23:07:32 +08:00
-												Update README.md
											
										
										
											2025-11-13 03:01:09 +08:00
+								 **🧪 Cookbooks:**
-												Update README.md
											
										
										
											2025-12-19 05:06:46 +08:00
+								- [Vectorless RAG](https://docs.pageindex.ai/cookbook/vectorless-rag-pageindex): A minimal, hands-on example of reasoning-based RAG using PageIndex — no vectors, no chunking, and human-like retrieval.
 								- [Vision-based vectorless RAG](https://docs.pageindex.ai/cookbook/vision-rag-pageindex): Experience OCR-free document understanding through PageIndex’s visual retrieval workflow that retrieves and reasons directly over PDF page images.
-												Update README.md
											
										
										
											2025-11-13 03:01:09 +08:00
+								</details>
-												Update README.md

											
										
										
											2025-11-05 23:07:32 +08:00
-												Update README.md
											
										
										
											2025-11-05 01:20:46 +08:00
+								# 📑 Introduction to PageIndex
-												Update README.md
											
										
										
											2025-08-20 01:52:48 +08:00
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								Are you frustrated with vector database retrieval accuracy for long professional documents? Traditional vector-based RAG relies on semantic *similarity* rather than true *relevance*. But **similarity ≠ relevance** — what we truly need in retrieval is **relevance**, and that requires **reasoning**. When working with professional documents that demand domain expertise and multi-step reasoning, similarity search often falls short.
-												Update README.md
											
										
										
											2025-05-20 12:31:27 +01:00
-												Update README.md
											
										
										
											2025-12-19 05:06:46 +08:00
+								Inspired by AlphaGo, we propose **[PageIndex](https://vectify.ai/pageindex)** — a **_vectorless_**, **reasoning-based RAG** system that builds a *hierarchical tree index* for long documents and *reasons* over that index for *retrieval*. It simulates how **human experts** navigate and extract knowledge from complex documents through **tree search**, enabling LLMs to *think* and *reason* their way to the most relevant document sections. PageIndex performs retrieval in two steps:
-												Update README.md
											
										
										
											2025-08-24 17:30:21 +08:00
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+. Generate a "Table-of-Contents" **tree structure index** of documents
 . Perform reasoning-based retrieval through **tree search**
-												Update README.md
											
										
										
											2025-08-24 17:30:21 +08:00
-												update README

											
										
										
											2025-08-25 17:50:53 +08:00
+								<div align="center">
-												Update README.md
											
										
										
											2025-11-05 01:20:46 +08:00
+								    <img src="https://docs.pageindex.ai/images/cookbook/vectorless-rag.png" width="70%">
-												update README

											
										
										
											2025-08-25 17:50:53 +08:00
+								</div>
-												Update README.md
											
										
										
											2025-08-24 17:30:21 +08:00
-												Update README.md
											
										
										
											2025-11-05 01:20:46 +08:00
+								### 🧩 Features
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
-												Update README.md
											
										
										
											2025-11-05 04:57:50 +08:00
+								Compared to traditional *vector-based RAG*, **PageIndex** features:
-												Update README.md
											
										
										
											2025-12-19 05:06:46 +08:00
+								- **No Vector DB**: Uses document structure and LLM reasoning for retrieval, instead of vector similarity search.
-												Update README

											
										
										
											2025-11-07 15:05:57 +08:00
+								- **No Chunking**: Documents are organized into natural sections, not artificial chunks.
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								- **Human-like Retrieval**: Simulates how human experts navigate and extract knowledge from complex documents.
-												Update README.md
											
										
										
											2025-12-19 03:50:44 +08:00
+								- **Better Explainability and Traceability**: Retrieval is based on reasoning — traceable and interpretable, with page and section references. No more opaque, approximate vector search ("vibe retrieval").
-												first commit

											
										
										
											2025-04-01 18:54:08 +08:00
-												Update README.md
											
										
										
											2025-11-07 03:06:26 +08:00
+								PageIndex powers a reasoning-based RAG system that achieved [98.7% accuracy](https://github.com/VectifyAI/Mafin2.5-FinanceBench) on FinanceBench, demonstrating **state-of-the-art** performance in professional document analysis (see our [blog post](https://vectify.ai/blog/Mafin2.5) for details).
-												first commit

											
										
										
											2025-04-01 18:54:08 +08:00
-												Update README.md
											
										
										
											2025-12-19 03:50:44 +08:00
+								### 📍 Explore PageIndex
-												Update README.md
											
										
										
											2025-12-19 05:06:46 +08:00
+								Please see a detailed introduction of the [PageIndex framework](https://pageindex.ai/blog/pageindex-intro). Check out our [GitHub repo](https://github.com/VectifyAI/PageIndex) for open-source code, and [cookbooks](https://docs.pageindex.ai/cookbook) and [tutorials](https://docs.pageindex.ai/tutorials) for additional usage guides and examples. The PageIndex service is available as a ChatGPT-style [chat platform](https://chat.pageindex.ai), or could be integrated via [MCP](https://pageindex.ai/mcp) or [API](https://docs.pageindex.ai/quickstart).
-												Update README.md
											
										
										
											2025-12-19 03:50:44 +08:00
-												Update README.md
											
										
										
											2025-11-02 05:00:58 +08:00
+								### ⚙️ Deployment Options
-												Update README.md
											
										
										
											2025-11-05 04:57:50 +08:00
+								- 🛠️ Self-host — run locally with this open-source repo.
-												Update README.md
											
										
										
											2025-12-19 03:50:44 +08:00
+								- ☁️ **Cloud Service** — try instantly with our 🖥️ [Chat Platform](https://chat.pageindex.ai/), 🔌 [MCP](https://pageindex.ai/mcp) or 📚 [API](https://docs.pageindex.ai/quickstart).
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
-												Update README.md
											
										
										
											2025-11-05 01:02:46 +08:00
+								### 🧪 Quick Hands-on
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
-												Update README.md
											
										
										
											2025-12-19 05:06:46 +08:00
+								- Try the [_**Vectorless RAG Notebook**_](https://github.com/VectifyAI/PageIndex/blob/main/cookbook/pageindex_RAG_simple.ipynb) — a *minimal*, hands-on example of reasoning-based RAG using PageIndex.
 								- Experiment with the [*Vision-based vectorless RAG*](https://github.com/VectifyAI/PageIndex/blob/main/cookbook/vision_RAG_pageindex.ipynb) — no OCR; a minimal, reasoning-native RAG pipeline that works directly over page images.
-												Update README.md
											
										
										
											2025-11-05 01:20:46 +08:00
-												Update README.md
											
										
										
											2025-11-06 00:06:26 +08:00
+								<div align="center">
 								  <a href="https://colab.research.google.com/github/VectifyAI/PageIndex/blob/main/cookbook/pageindex_RAG_simple.ipynb" target="_blank" rel="noopener">
 								    <img src="https://img.shields.io/badge/Open_In_Colab-Vectorless_RAG-orange?style=for-the-badge&logo=googlecolab" alt="Open in Colab: Vectorless RAG" />
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								  </a>
-												Update README.md
											
										
										
											2025-11-06 00:06:26 +08:00
+								  &nbsp;&nbsp;
 								  <a href="https://colab.research.google.com/github/VectifyAI/PageIndex/blob/main/cookbook/vision_RAG_pageindex.ipynb" target="_blank" rel="noopener">
 								    <img src="https://img.shields.io/badge/Open_In_Colab-Vision_RAG-orange?style=for-the-badge&logo=googlecolab" alt="Open in Colab: Vision RAG" />
 								  </a>
 								</div>
-												Update README.md
											
										
										
											2025-05-06 05:21:08 +08:00
-												update README

											
										
										
											2025-08-25 17:50:53 +08:00
+								---
-												Update README.md

											
										
										
											2025-04-12 01:27:24 +08:00
-												Update README.md
											
										
										
											2025-11-05 01:33:30 +08:00
+								# 🌲 PageIndex Tree Structure
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								PageIndex can transform lengthy PDF documents into a semantic **tree structure**, similar to a _"table of contents"_ but optimized for use with Large Language Models (LLMs). It's ideal for: financial reports, regulatory filings, academic textbooks, legal or technical manuals, and any document that exceeds LLM context limits.
-												first commit

											
										
										
											2025-04-01 18:54:08 +08:00
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								Here is an example output. See more [example documents](https://github.com/VectifyAI/PageIndex/tree/main/tests/pdfs) and [generated trees](https://github.com/VectifyAI/PageIndex/tree/main/tests/results).
-												first commit

											
										
										
											2025-04-01 18:54:08 +08:00
-												Update README.md
											
										
										
											2025-11-07 03:06:26 +08:00
+								```jsonc
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								...
 								{
 								  "title": "Financial Stability",
 								  "node_id": "0006",
 								  "start_index": 21,
 								  "end_index": 22,
 								  "summary": "The Federal Reserve ...",
 								  "nodes": [
 								    {
 								      "title": "Monitoring Financial Vulnerabilities",
 								      "node_id": "0007",
 								      "start_index": 22,
 								      "end_index": 28,
 								      "summary": "The Federal Reserve's monitoring ..."
 								    },
 								    {
 								      "title": "Domestic and International Cooperation and Coordination",
 								      "node_id": "0008",
 								      "start_index": 28,
 								      "end_index": 31,
 								      "summary": "In 2023, the Federal Reserve collaborated ..."
 								    }
 								  ]
 								}
 								...
 								```
-												Update README.md

											
										
										
											2025-04-12 01:27:24 +08:00
-												Revise cloud service links and clean up README

Updated cloud service references and removed redundant sections.
											
										
										
											2025-11-19 23:55:24 +08:00
+								 You can either generate the PageIndex tree structure with this open-source repo, or try our [API](https://docs.pageindex.ai/quickstart) service.
-												update README

											
										
										
											2025-08-25 17:50:53 +08:00
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								---
-												Update README.md

											
										
										
											2025-04-12 01:27:24 +08:00
-												Update README.md
											
										
										
											2025-11-05 01:33:30 +08:00
+								# 📦 Package Usage
-												first commit

											
										
										
											2025-04-01 18:54:08 +08:00
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								You can follow these steps to generate a PageIndex tree from a PDF document.
-												Update README.md

											
										
										
											2025-04-12 01:27:24 +08:00
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								### 1. Install dependencies
-												Update README.md

											
										
										
											2025-04-12 01:27:24 +08:00
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								```bash
 								pip3 install --upgrade -r requirements.txt
-												first commit

											
										
										
											2025-04-01 18:54:08 +08:00
+								```
-												Update README.md
											
										
										
											2025-04-10 11:38:54 +08:00
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								### 2. Set your OpenAI API key
-												Update README.md

											
										
										
											2025-04-12 01:27:24 +08:00
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								Create a `.env` file in the root directory and add your API key:
-												Update README.md

											
										
										
											2025-04-12 01:27:24 +08:00
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								```bash
 								CHATGPT_API_KEY=your_openai_key_here
 								```
-												Update README.md

											
										
										
											2025-04-12 01:27:24 +08:00
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								### 3. Run PageIndex on your PDF
 								```bash
 								python3 run_pageindex.py --pdf_path /path/to/your/document.pdf
-												Update README.md

											
										
										
											2025-04-12 01:27:24 +08:00
+								```
-												update README

											
										
										
											2025-08-25 17:50:53 +08:00
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								<details>
 								<summary><strong>Optional parameters</strong></summary>
 								<br>
 								You can customize the processing with additional optional arguments:
-												Update README.md

											
										
										
											2025-04-12 01:27:24 +08:00
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								```
 								--model                 OpenAI model to use (default: gpt-4o-2024-11-20)
 								--toc-check-pages       Pages to check for table of contents (default: 20)
 								--max-pages-per-node    Max pages per node (default: 10)
 								--max-tokens-per-node   Max tokens per node (default: 20000)
 								--if-add-node-id        Add node ID (yes/no, default: yes)
 								--if-add-node-summary   Add node summary (yes/no, default: yes)
 								--if-add-doc-description Add doc description (yes/no, default: yes)
 								```
 								</details>
-												Update README.md

											
										
										
											2025-04-12 01:27:24 +08:00
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								<details>
 								<summary><strong>Markdown support</strong></summary>
 								<br>
-												Update README.md
											
										
										
											2025-12-19 05:06:46 +08:00
+								We also provide markdown support for PageIndex. You can use the `-md_path` flag to generate a tree structure for a markdown file.
-												Add markdown support
											
										
										
											2025-09-02 10:18:28 +01:00
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								```bash
 								python3 run_pageindex.py --md_path /path/to/your/document.md
-												Add markdown support
											
										
										
											2025-09-02 10:18:28 +01:00
+								```
-												Update README.md
											
										
										
											2025-12-19 05:06:46 +08:00
+								> Note: in this function, we use "#" to determine node heading and their levels. For example, "##" is level 2, "###" is level 3, etc. Make sure your markdown file is formatted correctly. If your Markdown file was converted from a PDF or HTML, we don’t recommend using this function, since most existing conversion tools cannot preserve the original hierarchy. Instead, use our [PageIndex OCR](https://pageindex.ai/blog/ocr), which is designed to preserve the original hierarchy, to convert the PDF to a markdown file and then use this function.
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								</details>
 								---
-												Update README.md

											
										
										
											2025-11-05 23:07:32 +08:00
+								<!-- # ☁️ Improved Tree Generation with PageIndex OCR
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
-												Update README.md
											
										
										
											2025-11-05 04:07:05 +08:00
+								This repo is designed for generating PageIndex tree structure for simple PDFs, but many real-world use cases involve complex PDFs that are hard to parse by classic Python tools. However, extracting high-quality text from PDF documents remains a non-trivial challenge. Most OCR tools only extract page-level content, losing the broader document context and hierarchy.
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
 								To address this, we introduced PageIndex OCR — the first long-context OCR model designed to preserve the global structure of documents. PageIndex OCR significantly outperforms other leading OCR tools, such as those from Mistral and Contextual AI, in recognizing true hierarchy and semantic relationships across document pages.
 								- Experience next-level OCR quality with PageIndex OCR at our [Dashboard](https://dash.pageindex.ai/).
-												Update README.md
											
										
										
											2025-11-05 04:07:05 +08:00
+								- Integrate PageIndex OCR seamlessly into your stack via our [API](https://docs.pageindex.ai/quickstart).
-												Update README.md
											
										
										
											2025-08-20 01:23:03 +08:00
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								<p align="center">
-												Update README.md
											
										
										
											2025-11-05 01:33:30 +08:00
+								  <img src="https://github.com/user-attachments/assets/eb35d8ae-865c-4e60-a33b-ebbd00c41732" width="80%">
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								</p>
-												Update README.md
											
										
										
											2025-08-20 01:23:03 +08:00
-												Update README.md

											
										
										
											2025-11-05 23:07:32 +08:00
+								--- -->
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
-												Update README.md
											
										
										
											2025-12-19 05:06:46 +08:00
+								# 📈 Case Study: PageIndex Leads Finance QA Benchmark
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
-												Update README.md
											
										
										
											2025-11-07 03:06:26 +08:00
+								[Mafin 2.5](https://vectify.ai/mafin) is a reasoning-based RAG system for financial document analysis, powered by **PageIndex**. It achieved a state-of-the-art [**98.7% accuracy**](https://vectify.ai/blog/Mafin2.5) on the [FinanceBench](https://arxiv.org/abs/2311.11944) benchmark — significantly outperforming traditional vector-based RAG systems.
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
 								PageIndex's hierarchical indexing enabled precise navigation and extraction of relevant content from complex financial reports, such as SEC filings and earnings disclosures.
-												Update README.md

											
										
										
											2025-04-12 01:27:24 +08:00
-												Update README.md
											
										
										
											2025-11-06 00:06:26 +08:00
+								👉 Explore the full [benchmark results](https://github.com/VectifyAI/Mafin2.5-FinanceBench) and our [blog post](https://vectify.ai/blog/Mafin2.5) for detailed comparisons and performance metrics.
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
 								<div align="center">
 								  <a href="https://github.com/VectifyAI/Mafin2.5-FinanceBench">
-												Update README.md
											
										
										
											2025-11-05 01:42:57 +08:00
+								    <img src="https://github.com/user-attachments/assets/571aa074-d803-43c7-80c4-a04254b782a3" width="70%">
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								  </a>
 								</div>
 								---
-												Update README.md
											
										
										
											2025-11-21 01:25:46 +08:00
+								# 🧭 Resources
 								* 📖 [Tutorials](https://docs.pageindex.ai/doc-search): practical guides and strategies, including *Document Search* and *Tree Search*.
 								* 🧪 [Cookbooks](https://docs.pageindex.ai/cookbook/vectorless-rag-pageindex): hands-on, runnable examples and advanced use cases.
 								* 📝 [Blog](https://pageindex.ai/blog): technical articles, research insights, and product updates
 								* ⚙️ [MCP setup](https://pageindex.ai/mcp#quick-setup) & [API docs](https://docs.pageindex.ai/quickstart): integration details and configuration options.
 								---
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								### ⭐ Support Us
 								Leave a star if you like our project. Thank you!
 								<p>
-												Update README.md
											
										
										
											2025-11-07 03:06:26 +08:00
+								  <img src="https://github.com/user-attachments/assets/eae4ff38-48ae-4a7c-b19f-eab81201d794" width="80%">
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								</p>
 								### Connect with Us
 								[![Twitter](https://img.shields.io/badge/Twitter-000000?style=for-the-badge&logo=x&logoColor=white)](https://x.com/VectifyAI)&nbsp;
 								[![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/company/vectify-ai/)&nbsp;
 								[![Discord](https://img.shields.io/badge/Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white)](https://discord.com/invite/VuXuf29EUj)&nbsp;
 								[![Contact Us](https://img.shields.io/badge/Contact_Us-3B82F6?style=for-the-badge&logo=envelope&logoColor=white)](https://ii2abc2jejf.typeform.com/to/tK3AXl8T)
 								---
-												Update README.md
											
										
										
											2025-09-03 18:56:29 +08:00
-												Revise README for PageIndex branding and features
											
										
										
											2025-09-18 00:59:36 +01:00
+								© 2025 [Vectify AI](https://vectify.ai)