PageIndex/cookbook/pageindex_RAG_simple.ipynb
2025-08-21 23:41:19 +08:00

600 lines
25 KiB
Text
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "TCh9BTedHJK1"
},
"source": [
"![pageindex_banner](https://pageindex.ai/static/images/pageindex_banner.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nD0hb4TFHWTt"
},
"source": [
" <p align=\"center\"><i>Reasoning-based RAG&nbsp; ✧ &nbsp;No Vector DB&nbsp; ✧ &nbsp;No Chunking&nbsp; ✧ &nbsp;Human-like Retrieval</i></p>\n",
"\n",
"<p align=\"center\">\n",
" <a href=\"https://vectify.ai\">🏠 Homepage</a>&nbsp; • &nbsp;\n",
" <a href=\"https://dash.pageindex.ai\">🖥️ Dashboard</a>&nbsp; • &nbsp;\n",
" <a href=\"https://docs.pageindex.ai/quickstart\">📚 API Docs</a>&nbsp; • &nbsp;\n",
" <a href=\"https://github.com/vectifyai/pageindex\">📦 GitHub</a>&nbsp; • &nbsp;\n",
" <a href=\"https://discord.com/invite/VuXuf29EUj\">💬 Discord</a>&nbsp; • &nbsp;\n",
" <a href=\"https://ii2abc2jejf.typeform.com/to/tK3AXl8T\">✉️ Contact</a>&nbsp;\n",
"</p>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Ebvn5qfpcG1K"
},
"source": [
"# 🧠 Simple Vectorless RAG with PageIndex"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"PageIndex generates a searchable tree structure of documents, enabling reasoning-based retrieval through tree search — without vectors.\n",
"\n",
"- **No Vectors Needed**: Uses document structure and LLM reasoning for retrieval.\n",
"- **No Chunking Needed**: Documents are organized into natural sections rather than artificial chunks.\n",
"- **Human-like Retrieval**: Simulates how human experts navigate and extract knowledge from complex documents. \n",
"- **Transparent Retrieval Process**: Retrieval based on reasoning — say goodbye to approximate semantic search (\"vibe retrieval\").\n",
"\n",
"## 📝 About this Notebook\n",
"\n",
"This notebook demonstrates a simple example of **vectorless RAG** with PageIndex. You will learn:\n",
"- [x] Build a PageIndex tree structure of a document.\n",
"- [x] Perform reasoning-based retrieval with tree search.\n",
"- [x] Generate answers based on the retrieved context."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7ziuTbbWcG1L"
},
"source": [
"## Preparation\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "edTfrizMFK4c"
},
"source": [
"### Install dependencies"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"id": "LaoB58wQFNDh"
},
"outputs": [],
"source": [
"%pip install -q --upgrade pageindex openai"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WVEWzPKGcG1M"
},
"source": [
"### Setup environment"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "StvqfcK4cG1M"
},
"outputs": [],
"source": [
"import os, json, openai, requests, textwrap\n",
"from pageindex import PageIndexClient\n",
"from pprint import pprint\n",
"\n",
"PAGEINDEX_API_KEY = \"YOUR_PAGEINDEX_API_KEY\" # Get your PageIndex API key from https://dash.pageindex.ai/api-keys\n",
"OPENAI_API_KEY = \"YOUR_OPENAI_API_KEY\"\n",
"\n",
"pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "AR7PLeVbcG1N"
},
"source": [
"### Define utility functions"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"id": "hmj3POkDcG1N"
},
"outputs": [],
"source": [
"async def call_llm(prompt, model=\"gpt-4.1\", temperature=0):\n",
" client = openai.AsyncOpenAI(api_key=OPENAI_API_KEY)\n",
" response = await client.chat.completions.create(\n",
" model=model,\n",
" messages=[{\"role\": \"user\", \"content\": prompt}],\n",
" temperature=temperature\n",
" )\n",
" return response.choices[0].message.content.strip()\n",
"\n",
"def remove_fields(data, fields=['text'], max_len=50):\n",
" if isinstance(data, dict):\n",
" return {k: remove_fields(v, fields)\n",
" for k, v in data.items() if k not in fields}\n",
" elif isinstance(data, list):\n",
" return [remove_fields(item, fields) for item in data]\n",
" elif isinstance(data, str):\n",
" return (data[:max_len] + '...') if len(data) > max_len else data\n",
" return data\n",
"\n",
"def print_tree(tree, exclude_fields=['text', 'page_index']):\n",
" cleaned_tree = remove_fields(tree.copy(), exclude_fields)\n",
" pprint(cleaned_tree, sort_dicts=False, width=100)\n",
"\n",
"def show(text, width=100):\n",
" print(textwrap.fill(text, width=width))\n",
"\n",
"def create_node_mapping(tree):\n",
" \"\"\"Create a mapping of node_id to node for quick lookup\"\"\"\n",
" def get_all_nodes(tree):\n",
" if isinstance(tree, dict):\n",
" return [tree] + [node for child in tree.get('nodes', []) for node in get_all_nodes(child)]\n",
" elif isinstance(tree, list):\n",
" return [node for item in tree for node in get_all_nodes(item)]\n",
" return []\n",
" return {node[\"node_id\"]: node for node in get_all_nodes(tree) if node.get(\"node_id\")}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "heGtIMOVcG1N"
},
"source": [
"## Step 1: PageIndex Tree Generation"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Mzd1VWjwMUJL"
},
"source": [
"### Submit a document with PageIndex SDK"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "f6--eZPLcG1N",
"outputId": "ca688cfd-6c4b-4a57-dac2-f3c2604c4112"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Downloaded https://arxiv.org/pdf/2501.12948.pdf\n",
"Document Submitted: pi-cmek7luf400960ao3o0o8us4d\n"
]
}
],
"source": [
"# You can also use our GitHub repo to generate PageIndex structure\n",
"# https://github.com/VectifyAI/PageIndex\n",
"\n",
"pdf_url = \"https://arxiv.org/pdf/2501.12948.pdf\"\n",
"pdf_path = os.path.join(\"../data\", pdf_url.split('/')[-1])\n",
"os.makedirs(os.path.dirname(pdf_path), exist_ok=True)\n",
"\n",
"response = requests.get(pdf_url)\n",
"with open(pdf_path, \"wb\") as f:\n",
" f.write(response.content)\n",
"print(f\"Downloaded {pdf_url}\")\n",
"\n",
"doc_id = pi_client.submit_document(pdf_path)[\"doc_id\"]\n",
"print('Document Submitted:', doc_id)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4-Hrh0azcG1N"
},
"source": [
"### Get the generated PageIndex tree structure"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"id": "b1Q1g6vrcG1O",
"outputId": "dc944660-38ad-47ea-d358-be422edbae53"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Simplified Tree Structure of the Document:\n",
"[{'title': 'DeepSeek-R1: Incentivizing Reasoning Capability in...',\n",
" 'node_id': '0000',\n",
" 'prefix_summary': '# DeepSeek-R1: Incentivizing Reasoning Capability ...',\n",
" 'nodes': [{'title': 'Abstract',\n",
" 'node_id': '0001',\n",
" 'summary': 'The partial document introduces two reasoning mode...'},\n",
" {'title': 'Contents',\n",
" 'node_id': '0002',\n",
" 'summary': 'This partial document provides a detailed table of...'},\n",
" {'title': '1. Introduction',\n",
" 'node_id': '0003',\n",
" 'prefix_summary': 'The partial document introduces recent advancement...',\n",
" 'nodes': [{'title': '1.1. Contributions',\n",
" 'node_id': '0004',\n",
" 'summary': 'This partial document outlines the main contributi...'},\n",
" {'title': '1.2. Summary of Evaluation Results',\n",
" 'node_id': '0005',\n",
" 'summary': 'The partial document provides a summary of evaluat...'}]},\n",
" {'title': '2. Approach',\n",
" 'node_id': '0006',\n",
" 'prefix_summary': '## 2. Approach\\n',\n",
" 'nodes': [{'title': '2.1. Overview',\n",
" 'node_id': '0007',\n",
" 'summary': '### 2.1. Overview\\n\\nPrevious work has heavily relie...'},\n",
" {'title': '2.2. DeepSeek-R1-Zero: Reinforcement Learning on t...',\n",
" 'node_id': '0008',\n",
" 'prefix_summary': '### 2.2. DeepSeek-R1-Zero: Reinforcement Learning ...',\n",
" 'nodes': [{'title': '2.2.1. Reinforcement Learning Algorithm',\n",
" 'node_id': '0009',\n",
" 'summary': 'This partial document describes the Group '\n",
" 'Relative...'},\n",
" {'title': '2.2.2. Reward Modeling',\n",
" 'node_id': '0010',\n",
" 'summary': 'This partial document discusses the reward '\n",
" 'modelin...'},\n",
" {'title': '2.2.3. Training Template',\n",
" 'node_id': '0011',\n",
" 'summary': '#### 2.2.3. Training Template\\n'\n",
" '\\n'\n",
" 'To train DeepSeek-R...'},\n",
" {'title': '2.2.4. Performance, Self-evolution Process and Aha...',\n",
" 'node_id': '0012',\n",
" 'summary': 'This partial document discusses the performance, '\n",
" 's...'}]},\n",
" {'title': '2.3. DeepSeek-R1: Reinforcement Learning with Cold...',\n",
" 'node_id': '0013',\n",
" 'summary': 'This partial document describes the training pipel...'},\n",
" {'title': '2.4. Distillation: Empower Small Models with Reaso...',\n",
" 'node_id': '0014',\n",
" 'summary': 'This partial document discusses the process of dis...'}]},\n",
" {'title': '3. Experiment',\n",
" 'node_id': '0015',\n",
" 'prefix_summary': 'The partial document describes the experimental se...',\n",
" 'nodes': [{'title': '3.1. DeepSeek-R1 Evaluation',\n",
" 'node_id': '0016',\n",
" 'summary': 'This partial document presents a comprehensive eva...'},\n",
" {'title': '3.2. Distilled Model Evaluation',\n",
" 'node_id': '0017',\n",
" 'summary': 'This partial document presents an evaluation of va...'}]},\n",
" {'title': '4. Discussion',\n",
" 'node_id': '0018',\n",
" 'summary': 'This partial document discusses the comparative ef...'},\n",
" {'title': '5. Conclusion, Limitations, and Future Work',\n",
" 'node_id': '0019',\n",
" 'summary': 'This partial document presents the conclusion, lim...'},\n",
" {'title': 'References',\n",
" 'node_id': '0020',\n",
" 'summary': 'The partial document consists of a comprehensive r...'},\n",
" {'title': 'Appendix', 'node_id': '0021', 'summary': '## Appendix\\n'},\n",
" {'title': 'A. Contributions and Acknowledgments',\n",
" 'node_id': '0022',\n",
" 'summary': 'This partial document section details the contribu...'}]}]\n"
]
}
],
"source": [
"if pi_client.is_retrieval_ready(doc_id):\n",
" tree = pi_client.get_tree(doc_id, node_summary=True)['result']\n",
" print('Simplified Tree Structure of the Document:')\n",
" print_tree(tree)\n",
"else:\n",
" print(\"Processing document, please try again later...\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "USoCLOiQcG1O"
},
"source": [
"## Step 2: Reasoning-Based Retrieval with Tree Search"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Use LLM for tree search and identify nodes that might contain relevant context"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"id": "LLHNJAtTcG1O"
},
"outputs": [],
"source": [
"query = \"What are the conclusions in this document?\"\n",
"\n",
"tree_without_text = remove_fields(tree.copy(), fields=['text'])\n",
"\n",
"search_prompt = f\"\"\"\n",
"You are given a question and a tree structure of a document.\n",
"Each node contains a node id, node title, and a corresponding summary.\n",
"Your task is to find all nodes that are likely to contain the answer to the question.\n",
"\n",
"Question: {query}\n",
"\n",
"Document tree structure:\n",
"{json.dumps(tree_without_text, indent=2)}\n",
"\n",
"Please reply in the following JSON format:\n",
"{{\n",
" \"thinking\": \"<Your thinking process on which nodes are relevant to the question>\",\n",
" \"node_list\": [\"node_id_1\", \"node_id_2\", ..., \"node_id_n\"]\n",
"}}\n",
"Directly return the final JSON structure. Do not output anything else.\n",
"\"\"\"\n",
"\n",
"tree_search_result = await call_llm(search_prompt)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Print retrieved nodes and reasoning process"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"id": "P8DVUOuAen5u",
"outputId": "6bb6d052-ef30-4716-f88e-be98bcb7ebdb"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Reasoning Process:\n",
"The question asks for the conclusions in the document. Typically, conclusions are found in sections\n",
"explicitly titled 'Conclusion' or in sections summarizing the findings and implications of the work.\n",
"In this document tree, node 0019 ('5. Conclusion, Limitations, and Future Work') is the most\n",
"directly relevant, as it is dedicated to the conclusion and related topics. Additionally, the\n",
"'Abstract' (node 0001) may contain a high-level summary that sometimes includes concluding remarks,\n",
"but it is less likely to contain the full conclusions. Other sections like 'Discussion' (node 0018)\n",
"may discuss implications but are not explicitly conclusions. Therefore, the primary node is 0019.\n",
"\n",
"Retrieved Nodes:\n",
"Node ID: 0019\t Page: 16\t Title: 5. Conclusion, Limitations, and Future Work\n"
]
}
],
"source": [
"node_map = create_node_mapping(tree)\n",
"tree_search_result_json = json.loads(tree_search_result)\n",
"\n",
"print('Reasoning Process:')\n",
"show(tree_search_result_json['thinking'])\n",
"\n",
"print('\\nRetrieved Nodes:')\n",
"for node_id in tree_search_result_json[\"node_list\"]:\n",
" node = node_map[node_id]\n",
" print(f\"Node ID: {node['node_id']}\\t Page: {node['page_index']}\\t Title: {node['title']}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "10wOZDG_cG1O"
},
"source": [
"## Step 3: Answer Generation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extract relevant context from retrieved nodes"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 279
},
"id": "a7UCBnXlcG1O",
"outputId": "8a026ea3-4ef3-473a-a57b-b4565409749e"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Retrieved Context:\n",
"## 5. Conclusion, Limitations, and Future Work In this work, we share our journey in enhancing\n",
"model reasoning abilities through reinforcement learning. DeepSeek-R1-Zero represents a pure RL\n",
"approach without relying on cold-start data, achieving strong performance across various tasks.\n",
"DeepSeek-R1 is more powerful, leveraging cold-start data alongside iterative RL fine-tuning.\n",
"Ultimately, DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on a range of tasks. We\n",
"further explore distillation the reasoning capability to small dense models. We use DeepSeek-R1 as\n",
"the teacher model to generate 800K training samples, and fine-tune several small dense models. The\n",
"results are promising: DeepSeek-R1-Distill-Qwen-1.5B outperforms GPT-4o and Claude-3.5-Sonnet on\n",
"math benchmarks with $28.9 \\%$ on AIME and $83.9 \\%$ on MATH. Other dense models also achieve\n",
"impressive results, significantly outperforming other instructiontuned models based on the same\n",
"underlying checkpoints. In the fut...\n"
]
}
],
"source": [
"node_list = json.loads(tree_search_result)[\"node_list\"]\n",
"relevant_content = \"\\n\\n\".join(node_map[node_id][\"text\"] for node_id in node_list)\n",
"\n",
"print('Retrieved Context:')\n",
"show(relevant_content[:1000] + '...')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generate answer based on retrieved context"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 210
},
"id": "tcp_PhHzcG1O",
"outputId": "187ff116-9bb0-4ab4-bacb-13944460b5ff"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Generated Answer:\n",
"**Conclusions in this document:** - DeepSeek-R1-Zero, a pure reinforcement learning (RL) model\n",
"without cold-start data, achieves strong performance across various tasks. - DeepSeek-R1, which\n",
"combines cold-start data with iterative RL fine-tuning, is even more powerful and achieves\n",
"performance comparable to OpenAI-o1-1217 on a range of tasks. - Distilling DeepSeek-R1s reasoning\n",
"capabilities into smaller dense models is effective: DeepSeek-R1-Distill-Qwen-1.5B outperforms\n",
"GPT-4o and Claude-3.5-Sonnet on math benchmarks, and other dense models also show significant\n",
"improvements over similar instruction-tuned models. - Overall, the approaches described demonstrate\n",
"promising results in enhancing model reasoning abilities through RL and distillation.\n"
]
}
],
"source": [
"answer_prompt = f\"\"\"\n",
"Answer the question based on the context:\n",
"\n",
"Question: {query}\n",
"Context: {relevant_content}\n",
"\n",
"Provide a clear, concise answer based only on the context provided.\n",
"\"\"\"\n",
"\n",
"print('Generated Answer:')\n",
"answer = await call_llm(answer_prompt)\n",
"show(answer)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_1kaGD3GcG1O"
},
"source": [
"# 🎯 What's Next\n",
"\n",
"This notebook has demonstrated a basic, minimal example of **reasoning-based**, **vectorless** RAG with PageIndex. The workflow illustrates the core idea:\n",
"> *Generating a hierarchical tree structure from a document, reasoning over that tree structure, and extracting relevant context, without relying on a vector database or top-k similarity search*.\n",
"\n",
"While this notebook highlights a minimal workflow, the PageIndex framework is built to support **far more advanced** use cases. In upcoming tutorials, we will introduce:\n",
"* **Multi-Node Reasoning with Content Extraction** — Scale tree search to extract and select relevant content from multiple nodes.\n",
"* **Multi-Document Search** — Enable reasoning-based navigation across large document collections, extending beyond a single file.\n",
"* **Efficient Tree Search** — Improve tree search efficiency for long documents with a large number of nodes.\n",
"* **Expert Knowledge Integration and Preference Alignment** — Incorporate user preferences or expert insights by adding knowledge directly into the LLM tree search, without the need for fine-tuning.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 🔎 Learn More About PageIndex\n",
" <a href=\"https://vectify.ai\">🏠 Homepage</a>&nbsp; • &nbsp;\n",
" <a href=\"https://dash.pageindex.ai\">🖥️ Dashboard</a>&nbsp; • &nbsp;\n",
" <a href=\"https://docs.pageindex.ai/quickstart\">📚 API Docs</a>&nbsp; • &nbsp;\n",
" <a href=\"https://github.com/vectifyai/pageindex\">📦 GitHub</a>&nbsp; • &nbsp;\n",
" <a href=\"https://discord.com/invite/VuXuf29EUj\">💬 Discord</a>&nbsp; • &nbsp;\n",
" <a href=\"https://ii2abc2jejf.typeform.com/to/tK3AXl8T\">✉️ Contact</a>\n",
"\n",
"<br>\n",
"\n",
"© 2025 [Vectify AI](https://vectify.ai)"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 0
}