PageIndex/cookbook/pageindex_RAG_simple.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TCh9BTedHJK1"
      },
      "source": [
        "![pageindex_banner](https://pageindex.ai/static/images/pageindex_banner.jpg)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nD0hb4TFHWTt"
      },
      "source": [
        "  <p align=\"center\"><i>Reasoning-based RAG&nbsp; ✧ &nbsp;No Vector DB&nbsp; ✧ &nbsp;No Chunking&nbsp; ✧ &nbsp;Human-like Retrieval</i></p>\n",
        "\n",
        "<p align=\"center\">\n",
        "  <a href=\"https://vectify.ai\">🏠 Homepage</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://dash.pageindex.ai\">🖥️ Dashboard</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://docs.pageindex.ai/quickstart\">📚 API Docs</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://github.com/vectifyai/pageindex\">📦 GitHub</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://discord.com/invite/VuXuf29EUj\">💬 Discord</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://ii2abc2jejf.typeform.com/to/tK3AXl8T\">✉️ Contact</a>&nbsp;\n",
        "</p>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Ebvn5qfpcG1K"
      },
      "source": [
        "# 🧠 Simple Vectorless RAG with PageIndex\n",
        "\n",
        "PageIndex generates a searchable tree structure of documents, enabling reasoning-based retrieval through tree search — without vectors.\n",
        "\n",
        "- **No Vectors Needed**: Uses document structure and LLM reasoning for retrieval.\n",
        "- **No Chunking Needed**: Documents are organized into natural sections rather than artificial chunks.\n",
        "- **Human-like Retrieval**: Simulates how human experts navigate and extract knowledge from complex documents. \n",
        "- **Transparent Retrieval Process**: Retrieval based on reasoning — say goodbye to approximate semantic search ('vibe retrieval')."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 📝 About this Notebook\n",
        "\n",
        "This notebook demonstrates a simple example of **vectorless RAG** with PageIndex. You will learn:\n",
        "- [x] How to build a PageIndex tree structure of a document.\n",
        "- [x] How to perform reasoning-based retrieval with tree search.\n",
        "- [x] How to generate the answer based on the retrieved context."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7ziuTbbWcG1L"
      },
      "source": [
        "## Preparation\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "edTfrizMFK4c"
      },
      "source": [
        "### Install dependencies"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": true,
        "id": "LaoB58wQFNDh"
      },
      "outputs": [],
      "source": [
        "%pip install -q --upgrade pageindex openai"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WVEWzPKGcG1M"
      },
      "source": [
        "### Setup environment"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "StvqfcK4cG1M"
      },
      "outputs": [],
      "source": [
        "import os, json, openai, requests\n",
        "from pageindex import PageIndexClient\n",
        "from pprint import pprint\n",
        "from IPython.display import Markdown, display\n",
        "\n",
        "PAGEINDEX_API_KEY = \"YOUR_PAGEINDEX_API_KEY\"  # Get your PageIndex API key from https://dash.pageindex.ai/api-keys\n",
        "OPENAI_API_KEY = \"YOUR_OPENAI_API_KEY\"\n",
        "\n",
        "pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AR7PLeVbcG1N"
      },
      "source": [
        "### Define utility functions"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 137,
      "metadata": {
        "id": "hmj3POkDcG1N"
      },
      "outputs": [],
      "source": [
        "async def call_llm(prompt, model=\"gpt-4.1\", temperature=0):\n",
        "    client = openai.AsyncOpenAI(api_key=OPENAI_API_KEY)\n",
        "    response = await client.chat.completions.create(\n",
        "        model=model,\n",
        "        messages=[{\"role\": \"user\", \"content\": prompt}],\n",
        "        temperature=temperature\n",
        "    )\n",
        "    return response.choices[0].message.content.strip()\n",
        "\n",
        "def remove_fields(data, fields=['text'], max_len=50):\n",
        "    if isinstance(data, dict):\n",
        "        return {k: remove_fields(v, fields)\n",
        "            for k, v in data.items() if k not in fields}\n",
        "    elif isinstance(data, list):\n",
        "        return [remove_fields(item, fields) for item in data]\n",
        "    elif isinstance(data, str):\n",
        "        return (data[:max_len] + '...') if len(data) > max_len else data\n",
        "    return data\n",
        "\n",
        "def print_tree(tree, exclude_fields=['text', 'page_index']):\n",
        "    cleaned_tree = remove_fields(tree.copy(), exclude_fields)\n",
        "    pprint(cleaned_tree, sort_dicts=False, width=150)\n",
        "\n",
        "def print_markdown(*lines):\n",
        "    text = \"\\n\".join(lines)\n",
        "    display(Markdown(text))\n",
        "\n",
        "def create_node_mapping(tree):\n",
        "    \"\"\"Create a mapping of node_id to node for quick lookup\"\"\"\n",
        "    def get_all_nodes(tree):\n",
        "        if isinstance(tree, dict):\n",
        "            return [tree] + [node for child in tree.get('nodes', []) for node in get_all_nodes(child)]\n",
        "        elif isinstance(tree, list):\n",
        "            return [node for item in tree for node in get_all_nodes(item)]\n",
        "        return []\n",
        "    return {node[\"node_id\"]: node for node in get_all_nodes(tree) if node.get(\"node_id\")}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "heGtIMOVcG1N"
      },
      "source": [
        "## Step 1: PageIndex Tree Generation"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Mzd1VWjwMUJL"
      },
      "source": [
        "### Submit a document with PageIndex SDK"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "f6--eZPLcG1N",
        "outputId": "ca688cfd-6c4b-4a57-dac2-f3c2604c4112"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Downloaded https://arxiv.org/pdf/2501.12948.pdf\n",
            "Document Submitted: pi-cmek7luf400960ao3o0o8us4d\n"
          ]
        }
      ],
      "source": [
        "# You can also use our GitHub repo to generate PageIndex structure\n",
        "# https://github.com/VectifyAI/PageIndex\n",
        "\n",
        "pdf_url = \"https://arxiv.org/pdf/2501.12948.pdf\"\n",
        "pdf_path = os.path.join(\"../data\", pdf_url.split('/')[-1])\n",
        "os.makedirs(os.path.dirname(pdf_path), exist_ok=True)\n",
        "\n",
        "response = requests.get(pdf_url)\n",
        "with open(pdf_path, \"wb\") as f:\n",
        "    f.write(response.content)\n",
        "print(f\"Downloaded {pdf_url}\")\n",
        "\n",
        "doc_id = pi_client.submit_document(pdf_path)[\"doc_id\"]\n",
        "print('Document Submitted:', doc_id)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4-Hrh0azcG1N"
      },
      "source": [
        "### Get the generated PageIndex tree structure"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 138,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "id": "b1Q1g6vrcG1O",
        "outputId": "dc944660-38ad-47ea-d358-be422edbae53"
      },
      "outputs": [
        {
          "data": {
            "text/markdown": [
              "## Simplified Tree Structure of the Document\n",
              "---"
            ],
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "[{'title': 'DeepSeek-R1: Incentivizing Reasoning Capability in...',\n",
            "  'node_id': '0000',\n",
            "  'prefix_summary': '# DeepSeek-R1: Incentivizing Reasoning Capability ...',\n",
            "  'nodes': [{'title': 'Abstract', 'node_id': '0001', 'summary': 'The partial document introduces two reasoning mode...'},\n",
            "            {'title': 'Contents', 'node_id': '0002', 'summary': 'This partial document provides a detailed table of...'},\n",
            "            {'title': '1. Introduction',\n",
            "             'node_id': '0003',\n",
            "             'prefix_summary': 'The partial document introduces recent advancement...',\n",
            "             'nodes': [{'title': '1.1. Contributions', 'node_id': '0004', 'summary': 'This partial document outlines the main contributi...'},\n",
            "                       {'title': '1.2. Summary of Evaluation Results',\n",
            "                        'node_id': '0005',\n",
            "                        'summary': 'The partial document provides a summary of evaluat...'}]},\n",
            "            {'title': '2. Approach',\n",
            "             'node_id': '0006',\n",
            "             'prefix_summary': '## 2. Approach\\n',\n",
            "             'nodes': [{'title': '2.1. Overview', 'node_id': '0007', 'summary': '### 2.1. Overview\\n\\nPrevious work has heavily relie...'},\n",
            "                       {'title': '2.2. DeepSeek-R1-Zero: Reinforcement Learning on t...',\n",
            "                        'node_id': '0008',\n",
            "                        'prefix_summary': '### 2.2. DeepSeek-R1-Zero: Reinforcement Learning ...',\n",
            "                        'nodes': [{'title': '2.2.1. Reinforcement Learning Algorithm',\n",
            "                                   'node_id': '0009',\n",
            "                                   'summary': 'This partial document describes the Group Relative...'},\n",
            "                                  {'title': '2.2.2. Reward Modeling',\n",
            "                                   'node_id': '0010',\n",
            "                                   'summary': 'This partial document discusses the reward modelin...'},\n",
            "                                  {'title': '2.2.3. Training Template',\n",
            "                                   'node_id': '0011',\n",
            "                                   'summary': '#### 2.2.3. Training Template\\n\\nTo train DeepSeek-R...'},\n",
            "                                  {'title': '2.2.4. Performance, Self-evolution Process and Aha...',\n",
            "                                   'node_id': '0012',\n",
            "                                   'summary': 'This partial document discusses the performance, s...'}]},\n",
            "                       {'title': '2.3. DeepSeek-R1: Reinforcement Learning with Cold...',\n",
            "                        'node_id': '0013',\n",
            "                        'summary': 'This partial document describes the training pipel...'},\n",
            "                       {'title': '2.4. Distillation: Empower Small Models with Reaso...',\n",
            "                        'node_id': '0014',\n",
            "                        'summary': 'This partial document discusses the process of dis...'}]},\n",
            "            {'title': '3. Experiment',\n",
            "             'node_id': '0015',\n",
            "             'prefix_summary': 'The partial document describes the experimental se...',\n",
            "             'nodes': [{'title': '3.1. DeepSeek-R1 Evaluation',\n",
            "                        'node_id': '0016',\n",
            "                        'summary': 'This partial document presents a comprehensive eva...'},\n",
            "                       {'title': '3.2. Distilled Model Evaluation',\n",
            "                        'node_id': '0017',\n",
            "                        'summary': 'This partial document presents an evaluation of va...'}]},\n",
            "            {'title': '4. Discussion', 'node_id': '0018', 'summary': 'This partial document discusses the comparative ef...'},\n",
            "            {'title': '5. Conclusion, Limitations, and Future Work',\n",
            "             'node_id': '0019',\n",
            "             'summary': 'This partial document presents the conclusion, lim...'},\n",
            "            {'title': 'References', 'node_id': '0020', 'summary': 'The partial document consists of a comprehensive r...'},\n",
            "            {'title': 'Appendix', 'node_id': '0021', 'summary': '## Appendix\\n'},\n",
            "            {'title': 'A. Contributions and Acknowledgments',\n",
            "             'node_id': '0022',\n",
            "             'summary': 'This partial document section details the contribu...'}]}]\n"
          ]
        }
      ],
      "source": [
        "if pi_client.is_retrieval_ready(doc_id):\n",
        "    tree = pi_client.get_tree(doc_id, node_summary=True)['result']\n",
        "    print_markdown('## Simplified Tree Structure of the Document', '---')\n",
        "    print_tree(tree)\n",
        "else:\n",
        "    print(\"Processing document, please try again later...\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "USoCLOiQcG1O"
      },
      "source": [
        "## Step 2: Reasoning-Based Retrieval with Tree Search\n",
        "\n",
        "### Use LLM for tree search and decide which nodes might contain relevant context"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "LLHNJAtTcG1O"
      },
      "outputs": [],
      "source": [
        "query = \"What are the conclusions in this document?\"\n",
        "\n",
        "tree_without_text = remove_fields(tree.copy(), fields=['text'])\n",
        "\n",
        "search_prompt = f\"\"\"\n",
        "You are given a question and a tree structure of a document.\n",
        "Each node contains a node id, node title, and a corresponding summary.\n",
        "Your task is to find all nodes that are likely to contain the answer to the question.\n",
        "\n",
        "Question: {query}\n",
        "\n",
        "Document tree structure:\n",
        "{json.dumps(tree_without_text, indent=2)}\n",
        "\n",
        "Please reply in the following JSON format:\n",
        "{{\n",
        "    \"thinking\": \"<Your thinking process on which nodes are relevant to the question>\",\n",
        "    \"node_list\": [\"node_id_1\", \"node_id_2\", ..., \"node_id_n\"]\n",
        "}}\n",
        "Directly return the final JSON structure. Do not output anything else.\n",
        "\"\"\"\n",
        "\n",
        "tree_search_result = await call_llm(search_prompt)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Print retrieved nodes and reasoning process"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 206
        },
        "id": "P8DVUOuAen5u",
        "outputId": "6bb6d052-ef30-4716-f88e-be98bcb7ebdb"
      },
      "outputs": [
        {
          "data": {
            "text/markdown": [
              "## Reasoning Process\n",
              "---"
            ],
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/markdown": [
              "The question asks for the conclusions in the document. The most direct and relevant node is '5. Conclusion, Limitations, and Future Work' (node_id: 0019), as it explicitly contains the conclusion section. Additionally, the 'Abstract' (node_id: 0001) often summarizes the main findings and conclusions, and the 'Discussion' (node_id: 0018) may also contain concluding remarks or synthesis of results. However, the primary and most comprehensive source for conclusions is node 0019."
            ],
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/markdown": [
              "## Retrieved Nodes\n",
              "---"
            ],
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Node ID: 0019\t Page: 16\t Title: 5. Conclusion, Limitations, and Future Work\n"
          ]
        }
      ],
      "source": [
        "node_map = create_node_mapping(tree)\n",
        "tree_search_result_json = json.loads(tree_search_result)\n",
        "\n",
        "print_markdown('## Reasoning Process', '---')\n",
        "print_markdown(tree_search_result_json['thinking'])\n",
        "\n",
        "print_markdown('## Retrieved Nodes', '---')\n",
        "for node_id in tree_search_result_json[\"node_list\"]:\n",
        "    node = node_map[node_id]\n",
        "    print(f\"Node ID: {node['node_id']}\\t Page: {node['page_index']}\\t Title: {node['title']}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "10wOZDG_cG1O"
      },
      "source": [
        "## Step 3: Answer Generation\n",
        "\n",
        "### Extract relevant context from retrieved nodes"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 279
        },
        "id": "a7UCBnXlcG1O",
        "outputId": "8a026ea3-4ef3-473a-a57b-b4565409749e"
      },
      "outputs": [
        {
          "data": {
            "text/markdown": [
              "## Retrieved Context\n",
              "---"
            ],
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/markdown": [
              "## 5. Conclusion, Limitations, and Future Work\n",
              "\n",
              "In this work, we share our journey in enhancing model reasoning abilities through reinforcement learning. DeepSeek-R1-Zero represents a pure RL approach without relying on cold-start data, achieving strong performance across various tasks. DeepSeek-R1 is more powerful, leveraging cold-start data alongside iterative RL fine-tuning. Ultimately, DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on a range of tasks.\n",
              "\n",
              "We further explore distillation the reasoning capability to small dense models. We use DeepSeek-R1 as the teacher model to generate 800K training samples, and fine-tune several small dense models. The results are promising: DeepSeek-R1-Distill-Qwen-1.5B outperforms GPT-4o and Claude-3.5-Sonnet on math benchmarks with $28.9 \\%$ on AIME and $83.9 \\%$ on MATH. Other dense models also achieve impressive results, significantly outperforming other instructiontuned models based on the same underlying checkpoints.\n",
              "\n",
              "In the fut ..."
            ],
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "node_list = json.loads(tree_search_result)[\"node_list\"]\n",
        "relevant_content = \"\\n\\n\".join(node_map[node_id][\"text\"] for node_id in node_list)\n",
        "\n",
        "print_markdown('## Retrieved Context', '---')\n",
        "print_markdown(relevant_content[:1000] + ' ...')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Generate answer based on retrieved context"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 210
        },
        "id": "tcp_PhHzcG1O",
        "outputId": "187ff116-9bb0-4ab4-bacb-13944460b5ff"
      },
      "outputs": [
        {
          "data": {
            "text/markdown": [
              "## Generated Answer\n",
              "---"
            ],
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/markdown": [
              "**Conclusions in this document:**\n",
              "\n",
              "- DeepSeek-R1-Zero, a pure reinforcement learning (RL) model without cold-start data, achieves strong performance across various tasks.\n",
              "- DeepSeek-R1, which combines cold-start data with iterative RL fine-tuning, is even more powerful and achieves performance comparable to OpenAI-o1-1217 on a range of tasks.\n",
              "- The reasoning capabilities of DeepSeek-R1 can be successfully distilled into smaller dense models, with DeepSeek-R1-Distill-Qwen-1.5B outperforming GPT-4o and Claude-3.5-Sonnet on math benchmarks.\n",
              "- Other small dense models fine-tuned with DeepSeek-R1 data also significantly outperform other instruction-tuned models based on the same checkpoints."
            ],
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "answer_prompt = f\"\"\"\n",
        "Answer the question based on the context:\n",
        "\n",
        "Question: {query}\n",
        "Context: {relevant_content}\n",
        "\n",
        "Provide a clear, concise answer based only on the context provided.\n",
        "\"\"\"\n",
        "\n",
        "print_markdown('## Generated Answer', '---')\n",
        "answer = await call_llm(answer_prompt)\n",
        "print_markdown(answer)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_1kaGD3GcG1O"
      },
      "source": [
        "# 🎯 What's Next\n",
        "\n",
        "This notebook has demonstrated a basic, minimal example of **reasoning-based**, **vectorless** RAG with PageIndex. The workflow illustrates the core idea:\n",
        "> *Generating a hierarchical tree structure from a document, reasoning over that tree structure, and extracting relevant context, without relying on a vector database or top-k similarity search*.\n",
        "\n",
        "While this notebook highlights a minimal workflow, the PageIndex framework is built to support **far more advanced** use cases. In upcoming tutorials, we will introduce:\n",
        "* **Multi-Node Reasoning with Content Extraction** — Scale tree search to extract and select relevant content from multiple nodes.\n",
        "* **Multi-Document Search** — Enable reasoning-based navigation across large document collections, extending beyond a single file.\n",
        "* **Efficient Tree Search** — Improve tree search efficiency for long documents with a large number of nodes.\n",
        "* **Expert Knowledge Integration and Preference Alignment** — Incorporate user preferences or expert insights by adding knowledge directly into the LLM tree search, without the need for fine-tuning.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 🔎 Learn More About PageIndex\n",
        "  <a href=\"https://vectify.ai\">🏠 Homepage</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://dash.pageindex.ai\">🖥️ Dashboard</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://docs.pageindex.ai/quickstart\">📚 API Docs</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://github.com/vectifyai/pageindex\">📦 GitHub</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://discord.com/invite/VuXuf29EUj\">💬 Discord</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://ii2abc2jejf.typeform.com/to/tK3AXl8T\">✉️ Contact</a>\n",
        "\n",
        "<br>\n",
        "\n",
        "© 2025 [Vectify AI](https://vectify.ai)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}