diff --git a/cookbook/pageindex_RAG_simple.ipynb b/cookbook/pageindex_RAG_simple.ipynb index fa6400d..fed785c 100644 --- a/cookbook/pageindex_RAG_simple.ipynb +++ b/cookbook/pageindex_RAG_simple.ipynb @@ -39,13 +39,19 @@ "\n", "- **No Vectors Needed**: Uses document structure and LLM reasoning for retrieval.\n", "- **No Chunking Needed**: Documents are organized into natural sections rather than artificial chunks.\n", - "- **No Top-K Needed**: The LLM decides how many nodes need to be retrieved.\n", - "- **Transparent Retrieval Process**: Retrieval based on reasoning — say goodbye to approximate semantic search ('vibe retrieval').\n", + "- **Human-like Retrieval**: Simulates how human experts navigate and extract knowledge from complex documents. \n", + "- **Transparent Retrieval Process**: Retrieval based on reasoning — say goodbye to approximate semantic search ('vibe retrieval')." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 📝 About this Notebook\n", "\n", - "# 📝 About this Notebook\n", "This notebook demonstrates a simple example of **vectorless RAG** with PageIndex. You will learn:\n", - "- [x] How to generate PageIndex tree structure of a document.\n", - "- [x] How to perform retrieval with tree search.\n", + "- [x] How to build a PageIndex tree structure of a document.\n", + "- [x] How to perform reasoning-based retrieval with tree search.\n", "- [x] How to generate the answer based on the retrieved context." ] }, @@ -55,7 +61,7 @@ "id": "7ziuTbbWcG1L" }, "source": [ - "# Preparation\n", + "## Preparation\n", "\n" ] }, @@ -65,7 +71,7 @@ "id": "edTfrizMFK4c" }, "source": [ - "## Install Dependencies" + "### Install dependencies" ] }, { @@ -86,7 +92,7 @@ "id": "WVEWzPKGcG1M" }, "source": [ - "## Setup Environment" + "### Setup environment" ] }, { @@ -114,7 +120,7 @@ "id": "AR7PLeVbcG1N" }, "source": [ - "## Define Utility Functions" + "### Define utility functions" ] }, { @@ -169,7 +175,7 @@ "id": "heGtIMOVcG1N" }, "source": [ - "# Step 1: PageIndex Tree Generation" + "## Step 1: PageIndex Tree Generation" ] }, { @@ -178,7 +184,7 @@ "id": "Mzd1VWjwMUJL" }, "source": [ - "## Submit a document with PageIndex SDK" + "### Submit a document with PageIndex SDK" ] }, { @@ -224,7 +230,7 @@ "id": "4-Hrh0azcG1N" }, "source": [ - "## Get the generated PageIndex tree structure" + "### Get the generated PageIndex tree structure" ] }, { @@ -329,9 +335,9 @@ "id": "USoCLOiQcG1O" }, "source": [ - "# Step 2: Reasoning-Based Retrieval with Tree Search\n", + "## Step 2: Reasoning-Based Retrieval with Tree Search\n", "\n", - "#### Use LLM to search the PageIndex tree and decide which nodes may contain the relevant context." + "### Use LLM for tree search and decide which nodes might contain relevant context" ] }, { @@ -367,6 +373,13 @@ "tree_search_result = await call_llm(search_prompt)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Print retrieved nodes and reasoning process" + ] + }, { "cell_type": "code", "execution_count": null, @@ -426,8 +439,6 @@ } ], "source": [ - "### Print retrieval nodes\n", - "\n", "node_map = create_node_mapping(tree)\n", "tree_search_result_json = json.loads(tree_search_result)\n", "\n", @@ -446,9 +457,9 @@ "id": "10wOZDG_cG1O" }, "source": [ - "# Step 3: Answer Generation\n", + "## Step 3: Answer Generation\n", "\n", - "#### Extract context from relevant nodes and generate the final answer." + "### Extract relevant context from retrieved nodes" ] }, { @@ -496,12 +507,18 @@ } ], "source": [ - "# Prepare Retrieved Context\n", - "\n", "node_list = json.loads(tree_search_result)[\"node_list\"]\n", "relevant_content = \"\\n\\n\".join(node_map[node_id][\"text\"] for node_id in node_list)\n", + "\n", "print_markdown('## Retrieved Context', '---')\n", - "print_markdown(f'{relevant_content[:1000]} ...')" + "print_markdown(relevant_content[:1000] + ' ...')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generate answer based on retrieved context" ] }, { @@ -548,8 +565,6 @@ } ], "source": [ - "# Generate Answer\n", - "\n", "answer_prompt = f\"\"\"\n", "Answer the question based on the context:\n", "\n", @@ -572,15 +587,21 @@ "source": [ "# 🎯 What's Next\n", "\n", - "This notebook has demonstrated a basic example of **reasoning-based**, **vectorless** RAG with PageIndex. The workflow illustrates the core idea:\n", - "> *Generating a hierarchical tree structure from a document, reasoning over that tree structure, and extracting relevant context without relying on a vector database or top-k similarity search*.\n", + "This notebook has demonstrated a basic, minimal example of **reasoning-based**, **vectorless** RAG with PageIndex. The workflow illustrates the core idea:\n", + "> *Generating a hierarchical tree structure from a document, reasoning over that tree structure, and extracting relevant context, without relying on a vector database or top-k similarity search*.\n", "\n", "While this notebook highlights a minimal workflow, the PageIndex framework is built to support **far more advanced** use cases. In upcoming tutorials, we will introduce:\n", - "* **Multi-node reasoning for complex query** — Scale tree search to handle queries that require context from multiple nodes.\n", - "* **Multi-document search** — Enable reasoning-based navigation across large document collections, extending beyond a single file.\n", - "* **Efficient Tree search** — Improve tree search efficiency for long documents with a large number of nodes.\n", + "* **Multi-Node Reasoning with Content Extraction** — Scale tree search to extract and select relevant content from multiple nodes.\n", + "* **Multi-Document Search** — Enable reasoning-based navigation across large document collections, extending beyond a single file.\n", + "* **Efficient Tree Search** — Improve tree search efficiency for long documents with a large number of nodes.\n", "* **Expert Knowledge Integration and Preference Alignment** — Incorporate user preferences or expert insights by adding knowledge directly into the LLM tree search, without the need for fine-tuning.\n", - "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "# 🔎 Learn More About PageIndex\n", " 🏠 Homepage  •  \n", " 🖥️ Dashboard  •  \n", @@ -591,8 +612,7 @@ "\n", "
\n", "\n", - "© 2025 [Vectify AI](https://vectify.ai)\n", - "\n" + "© 2025 [Vectify AI](https://vectify.ai)" ] } ],