PageIndex/cookbook/pageindex_RAG_simple.ipynb
2025-08-26 21:09:15 +08:00

607 lines
24 KiB
Text
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "TCh9BTedHJK1"
},
"source": [
"![pageindex_banner](https://pageindex.ai/static/images/pageindex_banner.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nD0hb4TFHWTt"
},
"source": [
" <p align=\"center\"><i>Reasoning-based RAG&nbsp; ✧ &nbsp;No Vector DB&nbsp; ✧ &nbsp;No Chunking&nbsp; ✧ &nbsp;Human-like Retrieval</i></p>\n",
"\n",
"<p align=\"center\">\n",
" <a href=\"https://vectify.ai\">🏠 Homepage</a>&nbsp; • &nbsp;\n",
" <a href=\"https://dash.pageindex.ai\">🖥️ Dashboard</a>&nbsp; • &nbsp;\n",
" <a href=\"https://docs.pageindex.ai/quickstart\">📚 API Docs</a>&nbsp; • &nbsp;\n",
" <a href=\"https://github.com/vectifyai/pageindex\">📦 GitHub</a>&nbsp; • &nbsp;\n",
" <a href=\"https://discord.com/invite/VuXuf29EUj\">💬 Discord</a>&nbsp; • &nbsp;\n",
" <a href=\"https://ii2abc2jejf.typeform.com/to/tK3AXl8T\">✉️ Contact</a>&nbsp;\n",
"</p>\n",
"\n",
"<p align=\"center\">\n",
" <a href=\"https://github.com/VectifyAI/PageIndex/stargazers\">\n",
" <img src=\"https://img.shields.io/github/stars/VectifyAI/PageIndex?style=for-the-badge&logo=github&label=⭐%20Star%20Us\" alt=\"Star us on GitHub\" />\n",
" </a>\n",
"</p>\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Ebvn5qfpcG1K"
},
"source": [
"# Simple Vectorless RAG with PageIndex"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## PageIndex Introduction\n",
"PageIndex is a new **reasoning-based**, **vectorless RAG** framework that performs retrieval in two steps: \n",
"1. Generate a tree structure index of documents \n",
"2. Perform reasoning-based retrieval through tree search \n",
"\n",
"<div align=\"center\">\n",
" <img src=\"https://docs.pageindex.ai/images/cookbook/vectorless-rag.png\" width=\"70%\">\n",
"</div>\n",
"\n",
"Compared to traditional vector-based RAG, PageIndex features:\n",
"- **No Vectors Needed**: Uses document structure and LLM reasoning for retrieval.\n",
"- **No Chunking Needed**: Documents are organized into natural sections rather than artificial chunks.\n",
"- **Human-like Retrieval**: Simulates how human experts navigate and extract knowledge from complex documents. \n",
"- **Transparent Retrieval Process**: Retrieval based on reasoning — say goodbye to approximate semantic search (\"vibe retrieval\")."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 📝 Notebook Overview\n",
"\n",
"This notebook demonstrates a simple example of **vectorless RAG** with PageIndex. You will learn how to:\n",
"- [x] Build a PageIndex tree structure of a document\n",
"- [x] Perform reasoning-based retrieval with tree search\n",
"- [x] Generate answers based on the retrieved context\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7ziuTbbWcG1L"
},
"source": [
"## Step 0: Preparation\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "edTfrizMFK4c"
},
"source": [
"#### 0.1 Install PageIndex"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"id": "LaoB58wQFNDh"
},
"outputs": [],
"source": [
"%pip install -q --upgrade pageindex"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WVEWzPKGcG1M"
},
"source": [
"#### 0.2 Setup PageIndex"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "StvqfcK4cG1M"
},
"outputs": [],
"source": [
"from pageindex import PageIndexClient\n",
"import pageindex.utils as utils\n",
"\n",
"# Get your PageIndex API key from https://dash.pageindex.ai/api-keys\n",
"PAGEINDEX_API_KEY = \"YOUR_PAGEINDEX_API_KEY\"\n",
"pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 0.3 Setup LLM\n",
"\n",
"Choose your preferred LLM for reasoning-based retrieval. In this example, we use OpenAIs GPT-4.1."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import openai\n",
"OPENAI_API_KEY = \"YOUR_OPENAI_API_KEY\"\n",
"\n",
"async def call_llm(prompt, model=\"gpt-4.1\", temperature=0):\n",
" client = openai.AsyncOpenAI(api_key=OPENAI_API_KEY)\n",
" response = await client.chat.completions.create(\n",
" model=model,\n",
" messages=[{\"role\": \"user\", \"content\": prompt}],\n",
" temperature=temperature\n",
" )\n",
" return response.choices[0].message.content.strip()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "heGtIMOVcG1N"
},
"source": [
"## Step 1: PageIndex Tree Generation"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Mzd1VWjwMUJL"
},
"source": [
"#### 1.1 Submit a document for generating PageIndex tree"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "f6--eZPLcG1N",
"outputId": "ca688cfd-6c4b-4a57-dac2-f3c2604c4112"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Downloaded https://arxiv.org/pdf/2501.12948.pdf\n",
"Document Submitted: pi-cmeseq08w00vt0bo3u6tr244g\n"
]
}
],
"source": [
"import os, requests\n",
"\n",
"# You can also use our GitHub repo to generate PageIndex tree\n",
"# https://github.com/VectifyAI/PageIndex\n",
"\n",
"pdf_url = \"https://arxiv.org/pdf/2501.12948.pdf\"\n",
"pdf_path = os.path.join(\"../data\", pdf_url.split('/')[-1])\n",
"os.makedirs(os.path.dirname(pdf_path), exist_ok=True)\n",
"\n",
"response = requests.get(pdf_url)\n",
"with open(pdf_path, \"wb\") as f:\n",
" f.write(response.content)\n",
"print(f\"Downloaded {pdf_url}\")\n",
"\n",
"doc_id = pi_client.submit_document(pdf_path)[\"doc_id\"]\n",
"print('Document Submitted:', doc_id)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4-Hrh0azcG1N"
},
"source": [
"#### 1.2 Get the generated PageIndex tree structure"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"id": "b1Q1g6vrcG1O",
"outputId": "dc944660-38ad-47ea-d358-be422edbae53"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Simplified Tree Structure of the Document:\n",
"[{'title': 'DeepSeek-R1: Incentivizing Reasoning Cap...',\n",
" 'node_id': '0000',\n",
" 'prefix_summary': '# DeepSeek-R1: Incentivizing Reasoning C...',\n",
" 'nodes': [{'title': 'Abstract',\n",
" 'node_id': '0001',\n",
" 'summary': 'The partial document introduces two reas...'},\n",
" {'title': 'Contents',\n",
" 'node_id': '0002',\n",
" 'summary': 'This partial document provides a detaile...'},\n",
" {'title': '1. Introduction',\n",
" 'node_id': '0003',\n",
" 'prefix_summary': 'The partial document introduces recent a...',\n",
" 'nodes': [{'title': '1.1. Contributions',\n",
" 'node_id': '0004',\n",
" 'summary': 'This partial document outlines the main ...'},\n",
" {'title': '1.2. Summary of Evaluation Results',\n",
" 'node_id': '0005',\n",
" 'summary': 'The partial document provides a summary ...'}]},\n",
" {'title': '2. Approach',\n",
" 'node_id': '0006',\n",
" 'prefix_summary': '## 2. Approach\\n',\n",
" 'nodes': [{'title': '2.1. Overview',\n",
" 'node_id': '0007',\n",
" 'summary': '### 2.1. Overview\\n\\nPrevious work has hea...'},\n",
" {'title': '2.2. DeepSeek-R1-Zero: Reinforcement Lea...',\n",
" 'node_id': '0008',\n",
" 'prefix_summary': '### 2.2. DeepSeek-R1-Zero: Reinforcement...',\n",
" 'nodes': [{'title': '2.2.1. Reinforcement Learning Algorithm',\n",
" 'node_id': '0009',\n",
" 'summary': 'The partial document describes the Group...'},\n",
" {'title': '2.2.2. Reward Modeling',\n",
" 'node_id': '0010',\n",
" 'summary': 'This partial document discusses the rewa...'},\n",
" {'title': '2.2.3. Training Template',\n",
" 'node_id': '0011',\n",
" 'summary': '#### 2.2.3. Training Template\\n\\nTo train ...'},\n",
" {'title': '2.2.4. Performance, Self-evolution Proce...',\n",
" 'node_id': '0012',\n",
" 'summary': 'This partial document discusses the perf...'}]},\n",
" {'title': '2.3. DeepSeek-R1: Reinforcement Learning...',\n",
" 'node_id': '0013',\n",
" 'summary': 'This partial document describes the trai...'},\n",
" {'title': '2.4. Distillation: Empower Small Models ...',\n",
" 'node_id': '0014',\n",
" 'summary': 'This partial document discusses the proc...'}]},\n",
" {'title': '3. Experiment',\n",
" 'node_id': '0015',\n",
" 'prefix_summary': 'The partial document describes the exper...',\n",
" 'nodes': [{'title': '3.1. DeepSeek-R1 Evaluation',\n",
" 'node_id': '0016',\n",
" 'summary': 'This partial document presents a compreh...'},\n",
" {'title': '3.2. Distilled Model Evaluation',\n",
" 'node_id': '0017',\n",
" 'summary': 'This partial document presents an evalua...'}]},\n",
" {'title': '4. Discussion',\n",
" 'node_id': '0018',\n",
" 'summary': 'This partial document discusses the comp...'},\n",
" {'title': '5. Conclusion, Limitations, and Future W...',\n",
" 'node_id': '0019',\n",
" 'summary': 'This partial document presents the concl...'},\n",
" {'title': 'References',\n",
" 'node_id': '0020',\n",
" 'summary': 'This partial document consists of the re...'},\n",
" {'title': 'Appendix', 'node_id': '0021', 'summary': '## Appendix\\n'},\n",
" {'title': 'A. Contributions and Acknowledgments',\n",
" 'node_id': '0022',\n",
" 'summary': 'This partial document section details th...'}]}]\n"
]
}
],
"source": [
"if pi_client.is_retrieval_ready(doc_id):\n",
" tree = pi_client.get_tree(doc_id, node_summary=True)['result']\n",
" print('Simplified Tree Structure of the Document:')\n",
" utils.print_tree(tree)\n",
"else:\n",
" print(\"Processing document, please try again later...\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "USoCLOiQcG1O"
},
"source": [
"## Step 2: Reasoning-Based Retrieval with Tree Search"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2.1 Use LLM for tree search and identify nodes that might contain relevant context"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"id": "LLHNJAtTcG1O"
},
"outputs": [],
"source": [
"import json\n",
"\n",
"query = \"What are the conclusions in this document?\"\n",
"\n",
"tree_without_text = utils.remove_fields(tree.copy(), fields=['text'])\n",
"\n",
"search_prompt = f\"\"\"\n",
"You are given a question and a tree structure of a document.\n",
"Each node contains a node id, node title, and a corresponding summary.\n",
"Your task is to find all nodes that are likely to contain the answer to the question.\n",
"\n",
"Question: {query}\n",
"\n",
"Document tree structure:\n",
"{json.dumps(tree_without_text, indent=2)}\n",
"\n",
"Please reply in the following JSON format:\n",
"{{\n",
" \"thinking\": \"<Your thinking process on which nodes are relevant to the question>\",\n",
" \"node_list\": [\"node_id_1\", \"node_id_2\", ..., \"node_id_n\"]\n",
"}}\n",
"Directly return the final JSON structure. Do not output anything else.\n",
"\"\"\"\n",
"\n",
"tree_search_result = await call_llm(search_prompt)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2.2 Print retrieved nodes and reasoning process"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"id": "P8DVUOuAen5u",
"outputId": "6bb6d052-ef30-4716-f88e-be98bcb7ebdb"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Reasoning Process:\n",
"The question asks for the conclusions in the document. Typically, conclusions are found in sections\n",
"explicitly titled 'Conclusion' or in sections summarizing the findings and implications of the work.\n",
"In this document tree, node 0019 ('5. Conclusion, Limitations, and Future Work') is the most\n",
"directly relevant, as it is dedicated to the conclusion and related topics. Additionally, the\n",
"'Abstract' (node 0001) may contain a high-level summary that sometimes includes concluding remarks,\n",
"but it is less likely to contain the full conclusions. Other sections like 'Discussion' (node 0018)\n",
"may discuss implications but are not explicitly conclusions. Therefore, the primary node is 0019.\n",
"\n",
"Retrieved Nodes:\n",
"Node ID: 0019\t Page: 16\t Title: 5. Conclusion, Limitations, and Future Work\n"
]
}
],
"source": [
"node_map = utils.create_node_mapping(tree)\n",
"tree_search_result_json = json.loads(tree_search_result)\n",
"\n",
"print('Reasoning Process:')\n",
"utils.print_wrapped(tree_search_result_json['thinking'])\n",
"\n",
"print('\\nRetrieved Nodes:')\n",
"for node_id in tree_search_result_json[\"node_list\"]:\n",
" node = node_map[node_id]\n",
" print(f\"Node ID: {node['node_id']}\\t Page: {node['page_index']}\\t Title: {node['title']}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "10wOZDG_cG1O"
},
"source": [
"## Step 3: Answer Generation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3.1 Extract relevant context from retrieved nodes"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 279
},
"id": "a7UCBnXlcG1O",
"outputId": "8a026ea3-4ef3-473a-a57b-b4565409749e"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Retrieved Context:\n",
"\n",
"## 5. Conclusion, Limitations, and Future Work\n",
"\n",
"In this work, we share our journey in enhancing model reasoning abilities through reinforcement\n",
"learning. DeepSeek-R1-Zero represents a pure RL approach without relying on cold-start data,\n",
"achieving strong performance across various tasks. DeepSeek-R1 is more powerful, leveraging cold-\n",
"start data alongside iterative RL fine-tuning. Ultimately, DeepSeek-R1 achieves performance\n",
"comparable to OpenAI-o1-1217 on a range of tasks.\n",
"\n",
"We further explore distillation the reasoning capability to small dense models. We use DeepSeek-R1\n",
"as the teacher model to generate 800K training samples, and fine-tune several small dense models.\n",
"The results are promising: DeepSeek-R1-Distill-Qwen-1.5B outperforms GPT-4o and Claude-3.5-Sonnet on\n",
"math benchmarks with $28.9 \\%$ on AIME and $83.9 \\%$ on MATH. Other dense models also achieve\n",
"impressive results, significantly outperforming other instructiontuned models based on the same\n",
"underlying checkpoints.\n",
"\n",
"In the fut...\n"
]
}
],
"source": [
"node_list = json.loads(tree_search_result)[\"node_list\"]\n",
"relevant_content = \"\\n\\n\".join(node_map[node_id][\"text\"] for node_id in node_list)\n",
"\n",
"print('Retrieved Context:\\n')\n",
"utils.print_wrapped(relevant_content[:1000] + '...')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3.2 Generate answer based on retrieved context"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 210
},
"id": "tcp_PhHzcG1O",
"outputId": "187ff116-9bb0-4ab4-bacb-13944460b5ff"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Generated Answer:\n",
"\n",
"The conclusions in this document are:\n",
"\n",
"- DeepSeek-R1-Zero, a pure reinforcement learning (RL) approach without cold-start data, achieves\n",
"strong performance across various tasks.\n",
"- DeepSeek-R1, which combines cold-start data with iterative RL fine-tuning, is more powerful and\n",
"achieves performance comparable to OpenAI-o1-1217 on a range of tasks.\n",
"- Distilling DeepSeek-R1s reasoning capabilities into smaller dense models is promising; for\n",
"example, DeepSeek-R1-Distill-Qwen-1.5B outperforms GPT-4o and Claude-3.5-Sonnet on math benchmarks,\n",
"and other dense models also show significant improvements over similar instruction-tuned models.\n",
"\n",
"These results demonstrate the effectiveness of the RL-based approach and the potential for\n",
"distilling reasoning abilities into smaller models.\n"
]
}
],
"source": [
"answer_prompt = f\"\"\"\n",
"Answer the question based on the context:\n",
"\n",
"Question: {query}\n",
"Context: {relevant_content}\n",
"\n",
"Provide a clear, concise answer based only on the context provided.\n",
"\"\"\"\n",
"\n",
"print('Generated Answer:\\n')\n",
"answer = await call_llm(answer_prompt)\n",
"utils.print_wrapped(answer)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_1kaGD3GcG1O"
},
"source": [
"---\n",
"\n",
"## 🎯 What's Next\n",
"\n",
"This notebook has demonstrated a basic, minimal example of **reasoning-based**, **vectorless** RAG with PageIndex. The workflow illustrates the core idea:\n",
"> *Generating a hierarchical tree structure from a document, reasoning over that tree structure, and extracting relevant context, without relying on a vector database or top-k similarity search*.\n",
"\n",
"While this notebook highlights a minimal workflow, the PageIndex framework is built to support **far more advanced** use cases. In upcoming tutorials, we will introduce:\n",
"* **Multi-Node Reasoning with Content Extraction** — Scale tree search to extract and select relevant content from multiple nodes.\n",
"* **Multi-Document Search** — Enable reasoning-based navigation across large document collections, extending beyond a single file.\n",
"* **Efficient Tree Search** — Improve tree search efficiency for long documents with a large number of nodes.\n",
"* **Expert Knowledge Integration and Preference Alignment** — Incorporate user preferences or expert insights by adding knowledge directly into the LLM tree search, without the need for fine-tuning.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 🔎 Learn More About PageIndex\n",
" <a href=\"https://vectify.ai\">🏠 Homepage</a>&nbsp; • &nbsp;\n",
" <a href=\"https://dash.pageindex.ai\">🖥️ Dashboard</a>&nbsp; • &nbsp;\n",
" <a href=\"https://docs.pageindex.ai/quickstart\">📚 API Docs</a>&nbsp; • &nbsp;\n",
" <a href=\"https://github.com/vectifyai/pageindex\">📦 GitHub</a>&nbsp; • &nbsp;\n",
" <a href=\"https://discord.com/invite/VuXuf29EUj\">💬 Discord</a>&nbsp; • &nbsp;\n",
" <a href=\"https://ii2abc2jejf.typeform.com/to/tK3AXl8T\">✉️ Contact</a>\n",
"\n",
"<br>\n",
"\n",
"© 2025 [Vectify AI](https://vectify.ai)"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 0
}