diff --git a/README.md b/README.md index e3dc0d0..b821caf 100644 --- a/README.md +++ b/README.md @@ -5,19 +5,15 @@ - - - # 📄 PageIndex - - Are you frustrated with vector database retrieval accuracy for long professional documents? Traditional vector-based RAG relies on semantic *similarity* rather than true *relevance*. But **similarity ≠ relevance** — what we truly need in retrieval is **relevance**, and that requires **reasoning**. When working with professional documents that demand domain expertise and multi-step reasoning, similarity search often falls short. 🧠 **Reasoning-based RAG** offers a better alternative: enabling LLMs to *think* and *reason* their way to the most relevant document sections. Inspired by AlphaGo, we use *tree search* to perform structured document retrieval. **[PageIndex](https://vectify.ai/pageindex)** is a *document indexing system* that builds *search tree structures* from long documents, making them ready for reasoning-based RAG. It has been used to develop a RAG system that achieved 98.7% accuracy on [FinanceBench](https://vectify.ai/blog/Mafin2.5), demonstrating state-of-the-art performance in document analysis. +Self-host it with this open-source repo, or try our ☁️ [Cloud service](https://dash.pageindex.ai/) - no setup required. ### PageIndex OCR (Updates On 2025/08/07) This repo is designed for generating PageIndex tree structure with text input, but many real-world use cases involve PDFs that require OCR to convert them into Markdown. However, extracting high-quality text from PDF documents remains a non-trivial challenge. Most OCR tools only extract page-level content, losing the broader document context and hierarchy. @@ -29,7 +25,6 @@ To address this, we introduced PageIndex OCR — the first OCR system designed t image - --- # **⭐ What is PageIndex**