mirror of
https://github.com/VectifyAI/PageIndex.git
synced 2026-04-24 23:56:21 +02:00
Update README.md
This commit is contained in:
parent
e59f04a6b3
commit
b365d6dcd2
1 changed files with 26 additions and 23 deletions
49
README.md
49
README.md
|
|
@ -24,12 +24,11 @@
|
|||
|
||||
</div>
|
||||
|
||||
---
|
||||
|
||||
<details open>
|
||||
<summary><h2>📢 Recent Updates</h2></summary>
|
||||
<summary><h2>📢 Latest Updates</h2></summary>
|
||||
|
||||
**🔥 New Releases:**
|
||||
**🔥 Releases:**
|
||||
- [**PageIndex Chat**](https://chat.pageindex.ai): The first human-like document-analysis agent platform built for professional long documents. It can also be integrated via the [MCP](https://pageindex.ai/mcp) or [API](https://docs.pageindex.ai/quickstart) (beta).
|
||||
<!-- - [**PageIndex Chat API**](https://docs.pageindex.ai/quickstart): An API that brings PageIndex’s advanced long-document intelligence directly into your applications and workflows. -->
|
||||
<!-- - [PageIndex MCP](https://pageindex.ai/mcp): Bring PageIndex into Claude, Cursor, or any MCP-enabled agent. Chat with long PDFs in a reasoning-based, human-like way. -->
|
||||
|
|
@ -40,16 +39,18 @@
|
|||
|
||||
**🧪 Cookbooks:**
|
||||
- [Vectorless RAG](https://docs.pageindex.ai/cookbook/vectorless-rag-pageindex): A minimal, hands-on example of reasoning-based RAG using PageIndex — no vectors, no chunking, and human-like retrieval.
|
||||
- [Vision-based vectorless RAG](https://docs.pageindex.ai/cookbook/vision-rag-pageindex): Experience OCR-free document understanding through PageIndex’s visual retrieval workflow that retrieves and reasons directly over PDF page images.
|
||||
- [Vision-based Vectorless RAG](https://docs.pageindex.ai/cookbook/vision-rag-pageindex): Experience OCR-free document understanding through PageIndex’s visual retrieval workflow that retrieves and reasons directly over PDF page images.
|
||||
</details>
|
||||
|
||||
---
|
||||
|
||||
# 📑 Introduction to PageIndex
|
||||
|
||||
Are you frustrated with vector database retrieval accuracy for long professional documents? Traditional vector-based RAG relies on semantic *similarity* rather than true *relevance*. But **similarity ≠ relevance** — what we truly need in retrieval is **relevance**, and that requires **reasoning**. When working with professional documents that demand domain expertise and multi-step reasoning, similarity search often falls short.
|
||||
|
||||
Inspired by AlphaGo, we propose **[PageIndex](https://vectify.ai/pageindex)** — a **_vectorless_**, **reasoning-based RAG** system that builds a *hierarchical tree index* for long documents and *reasons* over that index for *retrieval*. It simulates how **human experts** navigate and extract knowledge from complex documents through **tree search**, enabling LLMs to *think* and *reason* their way to the most relevant document sections. PageIndex performs retrieval in two steps:
|
||||
Inspired by AlphaGo, we propose **[PageIndex](https://vectify.ai/pageindex)** — a ***vectorless***, **reasoning-based RAG** system that builds a **hierarchical tree index** from long documents and uses LLMs to **reason over that index** for retrieval. It simulates how *human experts* navigate and extract knowledge from complex documents through *tree search*, enabling LLMs to *think* and *reason* their way to the most relevant document sections. PageIndex performs retrieval in two steps:
|
||||
|
||||
1. Generate a "Table-of-Contents" **tree structure index** of documents
|
||||
1. Generate a “Table-of-Contents” **tree structure index** of documents
|
||||
2. Perform reasoning-based retrieval through **tree search**
|
||||
|
||||
<div align="center">
|
||||
|
|
@ -62,22 +63,22 @@ Compared to traditional *vector-based RAG*, **PageIndex** features:
|
|||
- **No Vector DB**: Uses document structure and LLM reasoning for retrieval, instead of vector similarity search.
|
||||
- **No Chunking**: Documents are organized into natural sections, not artificial chunks.
|
||||
- **Human-like Retrieval**: Simulates how human experts navigate and extract knowledge from complex documents.
|
||||
- **Better Explainability and Traceability**: Retrieval is based on reasoning — traceable and interpretable, with page and section references. No more opaque, approximate vector search ("vibe retrieval").
|
||||
- **Better Explainability and Traceability**: Retrieval is based on reasoning — traceable and interpretable, with page and section references. No more opaque, approximate vector search (“vibe retrieval”).
|
||||
|
||||
PageIndex powers a reasoning-based RAG system that achieved [98.7% accuracy](https://github.com/VectifyAI/Mafin2.5-FinanceBench) on FinanceBench, demonstrating **state-of-the-art** performance in professional document analysis (see our [blog post](https://vectify.ai/blog/Mafin2.5) for details).
|
||||
PageIndex powers a reasoning-based RAG system that achieved **state-of-the-art** [98.7% accuracy](https://github.com/VectifyAI/Mafin2.5-FinanceBench) on FinanceBench, demonstrating superior performance over vector RAG solutions in professional document analysis (details in our [blog post](https://vectify.ai/blog/Mafin2.5)).
|
||||
|
||||
### 📍 Explore PageIndex
|
||||
|
||||
Please see a detailed introduction of the [PageIndex framework](https://pageindex.ai/blog/pageindex-intro). Check out our [GitHub repo](https://github.com/VectifyAI/PageIndex) for open-source code, and [cookbooks](https://docs.pageindex.ai/cookbook) and [tutorials](https://docs.pageindex.ai/tutorials) for additional usage guides and examples. The PageIndex service is available as a ChatGPT-style [chat platform](https://chat.pageindex.ai), or could be integrated via [MCP](https://pageindex.ai/mcp) or [API](https://docs.pageindex.ai/quickstart).
|
||||
Please see a detailed introduction of the [PageIndex framework](https://pageindex.ai/blog/pageindex-intro). Check out this GitHub repo for open-source code, and [cookbooks](https://docs.pageindex.ai/cookbook) and [tutorials](https://docs.pageindex.ai/tutorials) for additional usage guides and examples. The PageIndex service is available as a ChatGPT-style [chat platform](https://chat.pageindex.ai), or could be integrated via [MCP](https://pageindex.ai/mcp) or [API](https://docs.pageindex.ai/quickstart).
|
||||
|
||||
### ⚙️ Deployment Options
|
||||
- 🛠️ Self-host — run locally with this open-source repo.
|
||||
- ☁️ **Cloud Service** — try instantly with our 🖥️ [Chat Platform](https://chat.pageindex.ai/), 🔌 [MCP](https://pageindex.ai/mcp) or 📚 [API](https://docs.pageindex.ai/quickstart).
|
||||
- Self-host — run locally with this open-source repo.
|
||||
- Cloud Service — try instantly with our [Chat Platform](https://chat.pageindex.ai/), or integrate with [MCP](https://pageindex.ai/mcp) or [API](https://docs.pageindex.ai/quickstart).
|
||||
|
||||
### 🧪 Quick Hands-on
|
||||
|
||||
- Try the [_**Vectorless RAG Notebook**_](https://github.com/VectifyAI/PageIndex/blob/main/cookbook/pageindex_RAG_simple.ipynb) — a *minimal*, hands-on example of reasoning-based RAG using PageIndex.
|
||||
- Experiment with the [*Vision-based vectorless RAG*](https://github.com/VectifyAI/PageIndex/blob/main/cookbook/vision_RAG_pageindex.ipynb) — no OCR; a minimal, reasoning-native RAG pipeline that works directly over page images.
|
||||
- Try the [**Vectorless RAG**](https://github.com/VectifyAI/PageIndex/blob/main/cookbook/pageindex_RAG_simple.ipynb) notebook — a *minimal*, hands-on example of reasoning-based RAG using PageIndex.
|
||||
- Experiment with [*Vision-based Vectorless RAG*](https://github.com/VectifyAI/PageIndex/blob/main/cookbook/vision_RAG_pageindex.ipynb) — no OCR; a minimal, reasoning-native RAG pipeline that works directly over page images.
|
||||
|
||||
<div align="center">
|
||||
<a href="https://colab.research.google.com/github/VectifyAI/PageIndex/blob/main/cookbook/pageindex_RAG_simple.ipynb" target="_blank" rel="noopener">
|
||||
|
|
@ -94,7 +95,7 @@ Please see a detailed introduction of the [PageIndex framework](https://pageinde
|
|||
# 🌲 PageIndex Tree Structure
|
||||
PageIndex can transform lengthy PDF documents into a semantic **tree structure**, similar to a _"table of contents"_ but optimized for use with Large Language Models (LLMs). It's ideal for: financial reports, regulatory filings, academic textbooks, legal or technical manuals, and any document that exceeds LLM context limits.
|
||||
|
||||
Here is an example output. See more [example documents](https://github.com/VectifyAI/PageIndex/tree/main/tests/pdfs) and [generated trees](https://github.com/VectifyAI/PageIndex/tree/main/tests/results).
|
||||
Below is an example PageIndex tree structure. Also see more example [documents](https://github.com/VectifyAI/PageIndex/tree/main/tests/pdfs) and generated [tree structures](https://github.com/VectifyAI/PageIndex/tree/main/tests/results).
|
||||
|
||||
```jsonc
|
||||
...
|
||||
|
|
@ -124,7 +125,7 @@ Here is an example output. See more [example documents](https://github.com/Vecti
|
|||
...
|
||||
```
|
||||
|
||||
You can either generate the PageIndex tree structure with this open-source repo, or try our [API](https://docs.pageindex.ai/quickstart) service.
|
||||
You can either generate the PageIndex tree structure with this open-source repo, or try our [API](https://docs.pageindex.ai/quickstart) service.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -180,9 +181,8 @@ python3 run_pageindex.py --md_path /path/to/your/document.md
|
|||
> Note: in this function, we use "#" to determine node heading and their levels. For example, "##" is level 2, "###" is level 3, etc. Make sure your markdown file is formatted correctly. If your Markdown file was converted from a PDF or HTML, we don’t recommend using this function, since most existing conversion tools cannot preserve the original hierarchy. Instead, use our [PageIndex OCR](https://pageindex.ai/blog/ocr), which is designed to preserve the original hierarchy, to convert the PDF to a markdown file and then use this function.
|
||||
</details>
|
||||
|
||||
---
|
||||
|
||||
<!-- # ☁️ Improved Tree Generation with PageIndex OCR
|
||||
<!--
|
||||
# ☁️ Improved Tree Generation with PageIndex OCR
|
||||
|
||||
This repo is designed for generating PageIndex tree structure for simple PDFs, but many real-world use cases involve complex PDFs that are hard to parse by classic Python tools. However, extracting high-quality text from PDF documents remains a non-trivial challenge. Most OCR tools only extract page-level content, losing the broader document context and hierarchy.
|
||||
|
||||
|
|
@ -195,15 +195,18 @@ To address this, we introduced PageIndex OCR — the first long-context OCR mode
|
|||
<img src="https://github.com/user-attachments/assets/eb35d8ae-865c-4e60-a33b-ebbd00c41732" width="80%">
|
||||
</p>
|
||||
|
||||
--- -->
|
||||
---
|
||||
-->
|
||||
|
||||
---
|
||||
|
||||
# 📈 Case Study: PageIndex Leads Finance QA Benchmark
|
||||
|
||||
[Mafin 2.5](https://vectify.ai/mafin) is a reasoning-based RAG system for financial document analysis, powered by **PageIndex**. It achieved a state-of-the-art [**98.7% accuracy**](https://vectify.ai/blog/Mafin2.5) on the [FinanceBench](https://arxiv.org/abs/2311.11944) benchmark — significantly outperforming traditional vector-based RAG systems.
|
||||
|
||||
PageIndex's hierarchical indexing enabled precise navigation and extraction of relevant content from complex financial reports, such as SEC filings and earnings disclosures.
|
||||
PageIndex's hierarchical indexing and reasoning-driven retrieval enable precise navigation and extraction of relevant context from complex financial reports, such as SEC filings and earnings disclosures.
|
||||
|
||||
👉 Explore the full [benchmark results](https://github.com/VectifyAI/Mafin2.5-FinanceBench) and our [blog post](https://vectify.ai/blog/Mafin2.5) for detailed comparisons and performance metrics.
|
||||
Explore the full [benchmark results](https://github.com/VectifyAI/Mafin2.5-FinanceBench) and our [blog post](https://vectify.ai/blog/Mafin2.5) for detailed comparisons and performance metrics.
|
||||
|
||||
<div align="center">
|
||||
<a href="https://github.com/VectifyAI/Mafin2.5-FinanceBench">
|
||||
|
|
@ -215,14 +218,14 @@ PageIndex's hierarchical indexing enabled precise navigation and extraction of r
|
|||
|
||||
# 🧭 Resources
|
||||
|
||||
* 📖 [Tutorials](https://docs.pageindex.ai/doc-search): practical guides and strategies, including *Document Search* and *Tree Search*.
|
||||
* 🧪 [Cookbooks](https://docs.pageindex.ai/cookbook/vectorless-rag-pageindex): hands-on, runnable examples and advanced use cases.
|
||||
* 📖 [Tutorials](https://docs.pageindex.ai/doc-search): practical guides and strategies, including *Document Search* and *Tree Search*.
|
||||
* 📝 [Blog](https://pageindex.ai/blog): technical articles, research insights, and product updates
|
||||
* ⚙️ [MCP setup](https://pageindex.ai/mcp#quick-setup) & [API docs](https://docs.pageindex.ai/quickstart): integration details and configuration options.
|
||||
|
||||
---
|
||||
|
||||
### ⭐ Support Us
|
||||
# ⭐ Support Us
|
||||
|
||||
Leave a star if you like our project. Thank you!
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue