{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "TCh9BTedHJK1" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "nD0hb4TFHWTt" }, "source": [ "
Reasoning-based RAG ✧ No Vector DB ✧ No Chunking ✧ Human-like Retrieval
\n", "\n", "\n", " 🏠 Homepage • \n", " 🖥️ Dashboard • \n", " 📚 API Docs • \n", " 📦 GitHub • \n", " 💬 Discord • \n", " ✉️ Contact \n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "Ebvn5qfpcG1K" }, "source": [ "# 🧠 Simple Vectorless RAG with PageIndex" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PageIndex generates a searchable tree structure of documents, enabling reasoning-based retrieval through tree search — without vectors.\n", "\n", "- **No Vectors Needed**: Uses document structure and LLM reasoning for retrieval.\n", "- **No Chunking Needed**: Documents are organized into natural sections rather than artificial chunks.\n", "- **Human-like Retrieval**: Simulates how human experts navigate and extract knowledge from complex documents. \n", "- **Transparent Retrieval Process**: Retrieval based on reasoning — say goodbye to approximate semantic search (\"vibe retrieval\").\n", "\n", "## 📝 About this Notebook\n", "\n", "This notebook demonstrates a simple example of **vectorless RAG** with PageIndex. You will learn:\n", "- [x] Build a PageIndex tree structure of a document.\n", "- [x] Perform reasoning-based retrieval with tree search.\n", "- [x] Generate answers based on the retrieved context." ] }, { "cell_type": "markdown", "metadata": { "id": "7ziuTbbWcG1L" }, "source": [ "## Preparation\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "edTfrizMFK4c" }, "source": [ "### Install dependencies" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "id": "LaoB58wQFNDh" }, "outputs": [], "source": [ "%pip install -q --upgrade pageindex openai" ] }, { "cell_type": "markdown", "metadata": { "id": "WVEWzPKGcG1M" }, "source": [ "### Setup environment" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "StvqfcK4cG1M" }, "outputs": [], "source": [ "import os, json, openai, requests, textwrap\n", "from pageindex import PageIndexClient\n", "from pprint import pprint\n", "\n", "# Get your PageIndex API key from https://dash.pageindex.ai/api-keys\n", "PAGEINDEX_API_KEY = \"YOUR_PAGEINDEX_API_KEY\"\n", "OPENAI_API_KEY = \"YOUR_OPENAI_API_KEY\"\n", "\n", "pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)" ] }, { "cell_type": "markdown", "metadata": { "id": "AR7PLeVbcG1N" }, "source": [ "### Define utility functions" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "id": "hmj3POkDcG1N" }, "outputs": [], "source": [ "async def call_llm(prompt, model=\"gpt-4.1\", temperature=0):\n", " client = openai.AsyncOpenAI(api_key=OPENAI_API_KEY)\n", " response = await client.chat.completions.create(\n", " model=model,\n", " messages=[{\"role\": \"user\", \"content\": prompt}],\n", " temperature=temperature\n", " )\n", " return response.choices[0].message.content.strip()\n", "\n", "def remove_fields(data, fields=['text'], max_len=40):\n", " if isinstance(data, dict):\n", " return {k: remove_fields(v, fields)\n", " for k, v in data.items() if k not in fields}\n", " elif isinstance(data, list):\n", " return [remove_fields(item, fields) for item in data]\n", " elif isinstance(data, str):\n", " return (data[:max_len] + '...') if len(data) > max_len else data\n", " return data\n", "\n", "def print_tree(tree, exclude_fields=['text', 'page_index']):\n", " cleaned_tree = remove_fields(tree.copy(), exclude_fields)\n", " pprint(cleaned_tree, sort_dicts=False, width=100)\n", "\n", "def show(text, width=100):\n", " for line in text.splitlines():\n", " print(textwrap.fill(line, width=width))\n", "\n", "def create_node_mapping(tree):\n", " \"\"\"Create a mapping of node_id to node for quick lookup\"\"\"\n", " def get_all_nodes(tree):\n", " if isinstance(tree, dict):\n", " return [tree] + [node for child in tree.get('nodes', []) for node in get_all_nodes(child)]\n", " elif isinstance(tree, list):\n", " return [node for item in tree for node in get_all_nodes(item)]\n", " return []\n", " return {node[\"node_id\"]: node for node in get_all_nodes(tree) if node.get(\"node_id\")}" ] }, { "cell_type": "markdown", "metadata": { "id": "heGtIMOVcG1N" }, "source": [ "## Step 1: PageIndex Tree Generation" ] }, { "cell_type": "markdown", "metadata": { "id": "Mzd1VWjwMUJL" }, "source": [ "### Submit a document with PageIndex SDK" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "f6--eZPLcG1N", "outputId": "ca688cfd-6c4b-4a57-dac2-f3c2604c4112" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloaded https://arxiv.org/pdf/2501.12948.pdf\n", "Document Submitted: pi-cmek7luf400960ao3o0o8us4d\n" ] } ], "source": [ "# You can also use our GitHub repo to generate PageIndex structure\n", "# https://github.com/VectifyAI/PageIndex\n", "\n", "pdf_url = \"https://arxiv.org/pdf/2501.12948.pdf\"\n", "pdf_path = os.path.join(\"../data\", pdf_url.split('/')[-1])\n", "os.makedirs(os.path.dirname(pdf_path), exist_ok=True)\n", "\n", "response = requests.get(pdf_url)\n", "with open(pdf_path, \"wb\") as f:\n", " f.write(response.content)\n", "print(f\"Downloaded {pdf_url}\")\n", "\n", "doc_id = pi_client.submit_document(pdf_path)[\"doc_id\"]\n", "print('Document Submitted:', doc_id)" ] }, { "cell_type": "markdown", "metadata": { "id": "4-Hrh0azcG1N" }, "source": [ "### Get the generated PageIndex tree structure" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "id": "b1Q1g6vrcG1O", "outputId": "dc944660-38ad-47ea-d358-be422edbae53" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Simplified Tree Structure of the Document:\n", "[{'title': 'DeepSeek-R1: Incentivizing Reasoning Cap...',\n", " 'node_id': '0000',\n", " 'prefix_summary': '# DeepSeek-R1: Incentivizing Reasoning C...',\n", " 'nodes': [{'title': 'Abstract',\n", " 'node_id': '0001',\n", " 'summary': 'The partial document introduces two reas...'},\n", " {'title': 'Contents',\n", " 'node_id': '0002',\n", " 'summary': 'This partial document provides a detaile...'},\n", " {'title': '1. Introduction',\n", " 'node_id': '0003',\n", " 'prefix_summary': 'The partial document introduces recent a...',\n", " 'nodes': [{'title': '1.1. Contributions',\n", " 'node_id': '0004',\n", " 'summary': 'This partial document outlines the main ...'},\n", " {'title': '1.2. Summary of Evaluation Results',\n", " 'node_id': '0005',\n", " 'summary': 'The partial document provides a summary ...'}]},\n", " {'title': '2. Approach',\n", " 'node_id': '0006',\n", " 'prefix_summary': '## 2. Approach\\n',\n", " 'nodes': [{'title': '2.1. Overview',\n", " 'node_id': '0007',\n", " 'summary': '### 2.1. Overview\\n\\nPrevious work has hea...'},\n", " {'title': '2.2. DeepSeek-R1-Zero: Reinforcement Lea...',\n", " 'node_id': '0008',\n", " 'prefix_summary': '### 2.2. DeepSeek-R1-Zero: Reinforcement...',\n", " 'nodes': [{'title': '2.2.1. Reinforcement Learning Algorithm',\n", " 'node_id': '0009',\n", " 'summary': 'This partial document describes the Grou...'},\n", " {'title': '2.2.2. Reward Modeling',\n", " 'node_id': '0010',\n", " 'summary': 'This partial document discusses the rewa...'},\n", " {'title': '2.2.3. Training Template',\n", " 'node_id': '0011',\n", " 'summary': '#### 2.2.3. Training Template\\n\\nTo train ...'},\n", " {'title': '2.2.4. Performance, Self-evolution Proce...',\n", " 'node_id': '0012',\n", " 'summary': 'This partial document discusses the perf...'}]},\n", " {'title': '2.3. DeepSeek-R1: Reinforcement Learning...',\n", " 'node_id': '0013',\n", " 'summary': 'This partial document describes the trai...'},\n", " {'title': '2.4. Distillation: Empower Small Models ...',\n", " 'node_id': '0014',\n", " 'summary': 'This partial document discusses the proc...'}]},\n", " {'title': '3. Experiment',\n", " 'node_id': '0015',\n", " 'prefix_summary': 'The partial document describes the exper...',\n", " 'nodes': [{'title': '3.1. DeepSeek-R1 Evaluation',\n", " 'node_id': '0016',\n", " 'summary': 'This partial document presents a compreh...'},\n", " {'title': '3.2. Distilled Model Evaluation',\n", " 'node_id': '0017',\n", " 'summary': 'This partial document presents an evalua...'}]},\n", " {'title': '4. Discussion',\n", " 'node_id': '0018',\n", " 'summary': 'This partial document discusses the comp...'},\n", " {'title': '5. Conclusion, Limitations, and Future W...',\n", " 'node_id': '0019',\n", " 'summary': 'This partial document presents the concl...'},\n", " {'title': 'References',\n", " 'node_id': '0020',\n", " 'summary': 'The partial document consists of a compr...'},\n", " {'title': 'Appendix', 'node_id': '0021', 'summary': '## Appendix\\n'},\n", " {'title': 'A. Contributions and Acknowledgments',\n", " 'node_id': '0022',\n", " 'summary': 'This partial document section details th...'}]}]\n" ] } ], "source": [ "if pi_client.is_retrieval_ready(doc_id):\n", " tree = pi_client.get_tree(doc_id, node_summary=True)['result']\n", " print('Simplified Tree Structure of the Document:')\n", " print_tree(tree)\n", "else:\n", " print(\"Processing document, please try again later...\")" ] }, { "cell_type": "markdown", "metadata": { "id": "USoCLOiQcG1O" }, "source": [ "## Step 2: Reasoning-Based Retrieval with Tree Search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use LLM for tree search and identify nodes that might contain relevant context" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "id": "LLHNJAtTcG1O" }, "outputs": [], "source": [ "query = \"What are the conclusions in this document?\"\n", "\n", "tree_without_text = remove_fields(tree.copy(), fields=['text'])\n", "\n", "search_prompt = f\"\"\"\n", "You are given a question and a tree structure of a document.\n", "Each node contains a node id, node title, and a corresponding summary.\n", "Your task is to find all nodes that are likely to contain the answer to the question.\n", "\n", "Question: {query}\n", "\n", "Document tree structure:\n", "{json.dumps(tree_without_text, indent=2)}\n", "\n", "Please reply in the following JSON format:\n", "{{\n", " \"thinking\": \"