mirror of
https://github.com/VectifyAI/PageIndex.git
synced 2026-04-24 23:56:21 +02:00
275 lines
10 KiB
Text
275 lines
10 KiB
Text
|
|
{
|
||
|
|
"cells": [
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"metadata": {
|
||
|
|
"id": "XTboY7brzyp2"
|
||
|
|
},
|
||
|
|
"source": [
|
||
|
|
""
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"metadata": {
|
||
|
|
"id": "EtjMbl9Pz3S-"
|
||
|
|
},
|
||
|
|
"source": [
|
||
|
|
"<p align=\"center\">Reasoning-based RAG ◦ No Vector DB ◦ No Chunking ◦ Human-like Retrieval</p>\n",
|
||
|
|
"\n",
|
||
|
|
"<p align=\"center\">\n",
|
||
|
|
" <a href=\"https://vectify.ai\">🏠 Homepage</a> • \n",
|
||
|
|
" <a href=\"https://chat.pageindex.ai\">🖥️ Platform</a> • \n",
|
||
|
|
" <a href=\"https://docs.pageindex.ai/quickstart\">📚 API Docs</a> • \n",
|
||
|
|
" <a href=\"https://github.com/VectifyAI/PageIndex\">📦 GitHub</a> • \n",
|
||
|
|
" <a href=\"https://discord.com/invite/VuXuf29EUj\">💬 Discord</a> • \n",
|
||
|
|
" <a href=\"https://ii2abc2jejf.typeform.com/to/tK3AXl8T\">✉️ Contact</a> \n",
|
||
|
|
"</p>\n",
|
||
|
|
"\n",
|
||
|
|
"<div align=\"center\">\n",
|
||
|
|
"\n",
|
||
|
|
"[](https://github.com/VectifyAI/PageIndex) [](https://twitter.com/VectifyAI)\n",
|
||
|
|
"\n",
|
||
|
|
"</div>\n",
|
||
|
|
"\n",
|
||
|
|
"---\n"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"metadata": {
|
||
|
|
"id": "bbC9uLWCz8zl"
|
||
|
|
},
|
||
|
|
"source": [
|
||
|
|
"# Document QA with PageIndex Chat API\n",
|
||
|
|
"\n",
|
||
|
|
"Similarity-based RAG based on Vector-DB has shown big limitations in recent AI applications, reasoning-based or agentic retrieval has become important in current developments.\n",
|
||
|
|
"\n",
|
||
|
|
"[PageIndex Chat](https://chat.pageindex.ai/) is a AI assistant that allow you chat with multiple super-long documents without worrying about limited context or context rot problem. It is based on [PageIndex](https://pageindex.ai/blog/pageindex-intro), a vectorless reasoning-based RAG framework which gives more transparent and reliable results like a human expert.\n",
|
||
|
|
"<div align=\"center\">\n",
|
||
|
|
" <img src=\"https://docs.pageindex.ai/images/cookbook/vectorless-rag.png\" width=\"70%\">\n",
|
||
|
|
"</div>\n",
|
||
|
|
"\n",
|
||
|
|
"You can now access PageIndex Chat with API or SDK.\n",
|
||
|
|
"\n",
|
||
|
|
"## 📝 Notebook Overview\n",
|
||
|
|
"\n",
|
||
|
|
"This notebook demonstrates a simple, minimal example of doing document analysis with PageIndex Chat API on the recently released [NVIDA 10Q report](https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/13e6981b-95ed-4aac-a602-ebc5865d0590.pdf)."
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"metadata": {
|
||
|
|
"id": "77SQbPoe-LTN"
|
||
|
|
},
|
||
|
|
"source": [
|
||
|
|
"### Install PageIndex SDK"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 2,
|
||
|
|
"metadata": {
|
||
|
|
"id": "6Eiv_cHf0OXz"
|
||
|
|
},
|
||
|
|
"outputs": [],
|
||
|
|
"source": [
|
||
|
|
"%pip install -q --upgrade pageindex"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"metadata": {
|
||
|
|
"id": "UR9-qkdD-Om7"
|
||
|
|
},
|
||
|
|
"source": [
|
||
|
|
"### Setup PageIndex"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 25,
|
||
|
|
"metadata": {
|
||
|
|
"id": "AFzsW4gq0fjh"
|
||
|
|
},
|
||
|
|
"outputs": [],
|
||
|
|
"source": [
|
||
|
|
"from pageindex import PageIndexClient\n",
|
||
|
|
"\n",
|
||
|
|
"# Get your PageIndex API key from https://dash.pageindex.ai/api-keys\n",
|
||
|
|
"PAGEINDEX_API_KEY = \"Your API KEY\"\n",
|
||
|
|
"pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"metadata": {
|
||
|
|
"id": "uvzf9oWL-Ts9"
|
||
|
|
},
|
||
|
|
"source": [
|
||
|
|
"### Upload a document"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 4,
|
||
|
|
"metadata": {
|
||
|
|
"colab": {
|
||
|
|
"base_uri": "https://localhost:8080/"
|
||
|
|
},
|
||
|
|
"id": "qf7sNRoL0hGw",
|
||
|
|
"outputId": "e8c2f3c1-1d1e-4932-f8e9-3272daae6781"
|
||
|
|
},
|
||
|
|
"outputs": [
|
||
|
|
{
|
||
|
|
"name": "stdout",
|
||
|
|
"output_type": "stream",
|
||
|
|
"text": [
|
||
|
|
"Downloaded https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/13e6981b-95ed-4aac-a602-ebc5865d0590.pdf\n",
|
||
|
|
"Document Submitted: pi-cmi73f7r7022y09nwn40paaom\n"
|
||
|
|
]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"source": [
|
||
|
|
"import os, requests\n",
|
||
|
|
"\n",
|
||
|
|
"pdf_url = \"https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/13e6981b-95ed-4aac-a602-ebc5865d0590.pdf\"\n",
|
||
|
|
"pdf_path = os.path.join(\"../data\", pdf_url.split('/')[-1])\n",
|
||
|
|
"os.makedirs(os.path.dirname(pdf_path), exist_ok=True)\n",
|
||
|
|
"\n",
|
||
|
|
"response = requests.get(pdf_url)\n",
|
||
|
|
"with open(pdf_path, \"wb\") as f:\n",
|
||
|
|
" f.write(response.content)\n",
|
||
|
|
"print(f\"Downloaded {pdf_url}\")\n",
|
||
|
|
"\n",
|
||
|
|
"doc_id = pi_client.submit_document(pdf_path)[\"doc_id\"]\n",
|
||
|
|
"print('Document Submitted:', doc_id)"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"metadata": {
|
||
|
|
"id": "U4hpLB4T-fCt"
|
||
|
|
},
|
||
|
|
"source": [
|
||
|
|
"### Check the processing status"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 22,
|
||
|
|
"metadata": {
|
||
|
|
"colab": {
|
||
|
|
"base_uri": "https://localhost:8080/"
|
||
|
|
},
|
||
|
|
"id": "PB1S_CWd2n87",
|
||
|
|
"outputId": "c1416161-a1d6-4f9e-873c-7f6e26c8fa5f"
|
||
|
|
},
|
||
|
|
"outputs": [
|
||
|
|
{
|
||
|
|
"name": "stdout",
|
||
|
|
"output_type": "stream",
|
||
|
|
"text": [
|
||
|
|
"{'createdAt': '2025-11-20T07:11:44.669000',\n",
|
||
|
|
" 'description': \"This document is NVIDIA Corporation's Form 10-Q Quarterly \"\n",
|
||
|
|
" 'Report for the period ending October 26, 2025, detailing its '\n",
|
||
|
|
" 'financial performance, operational results, market risks, and '\n",
|
||
|
|
" 'legal proceedings.',\n",
|
||
|
|
" 'id': 'pi-cmi73f7r7022y09nwn40paaom',\n",
|
||
|
|
" 'name': '13e6981b-95ed-4aac-a602-ebc5865d0590.pdf',\n",
|
||
|
|
" 'pageNum': 48,\n",
|
||
|
|
" 'status': 'completed'}\n",
|
||
|
|
"\n",
|
||
|
|
" Document ready! (48 pages)\n"
|
||
|
|
]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"source": [
|
||
|
|
"from pprint import pprint\n",
|
||
|
|
"\n",
|
||
|
|
"doc_info = pi_client.get_document(doc_id)\n",
|
||
|
|
"pprint(doc_info)\n",
|
||
|
|
"\n",
|
||
|
|
"if doc_info['status'] == 'completed':\n",
|
||
|
|
" print(f\"\\n Document ready! ({doc_info['pageNum']} pages)\")\n",
|
||
|
|
"elif doc_info['status'] == 'processing':\n",
|
||
|
|
" print(\"\\n Document is still processing. Please wait and check again.\")"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"metadata": {
|
||
|
|
"id": "z1C9FOvO-p1m"
|
||
|
|
},
|
||
|
|
"source": [
|
||
|
|
"### Ask a question about this document"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 24,
|
||
|
|
"metadata": {
|
||
|
|
"colab": {
|
||
|
|
"base_uri": "https://localhost:8080/"
|
||
|
|
},
|
||
|
|
"id": "X3RbQvy_0nt7",
|
||
|
|
"outputId": "a8bb5468-3f5f-4225-a5fd-60004041a644"
|
||
|
|
},
|
||
|
|
"outputs": [
|
||
|
|
{
|
||
|
|
"name": "stdout",
|
||
|
|
"output_type": "stream",
|
||
|
|
"text": [
|
||
|
|
"I'll help you find the revenue information from NVIDIA's 10-Q report. Let me first check the document structure to locate the financial information.{\"doc_name\": \"13e6981b-95ed-4aac-a602-ebc5865d0590.pdf\"}Now let me get the content from pages that contain the revenue information. Based on the structure, revenue information should be in the financial statements section, which appears early in the document. Let me check pages 3-8 which should contain the Condensed Consolidated Statements of Income.{\"doc_name\": \"13e6981b-95ed-4aac-a602-ebc5865d0590.pdf\", \"pages\": \"3-5\"}Perfect! I found the revenue information. Let me provide you with the details:\n",
|
||
|
|
"\n",
|
||
|
|
"## Revenue Information from NVIDIA's 10-Q Report\n",
|
||
|
|
"\n",
|
||
|
|
"**You can find the revenue on Page 3** of the document in the **Condensed Consolidated Statements of Income**.\n",
|
||
|
|
"\n",
|
||
|
|
"### Revenue Details:\n",
|
||
|
|
"\n",
|
||
|
|
"**For the Three Months Ended October 26, 2025 (Q3 FY2026):**\n",
|
||
|
|
"- **Revenue: $57,006 million** ($57.0 billion)\n",
|
||
|
|
"- Compared to Q3 FY2025 (October 27, 2024): $35,082 million\n",
|
||
|
|
"\n",
|
||
|
|
"**For the Nine Months Ended October 26, 2025 (First 9 months of FY2026):**\n",
|
||
|
|
"- **Revenue: $147,811 million** ($147.8 billion)\n",
|
||
|
|
"- Compared to same period FY2025: $91,166 million\n",
|
||
|
|
"\n",
|
||
|
|
"### Key Highlights:\n",
|
||
|
|
"- Q3 revenue increased by **62.5%** year-over-year ($21.9 billion increase)\n",
|
||
|
|
"- Nine-month revenue increased by **62.1%** year-over-year ($56.6 billion increase)\n",
|
||
|
|
"- This represents strong growth driven primarily by Data Center compute and networking platforms for AI and accelerated computing, with Blackwell architectures being a major contributor\n",
|
||
|
|
"\n",
|
||
|
|
"The revenue figures are clearly displayed at the top of the Condensed Consolidated Statements of Income on **Page 3** of the 10-Q report."
|
||
|
|
]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"source": [
|
||
|
|
"query = \"what is the revenue? Also show me which page I can find it.\"\n",
|
||
|
|
"\n",
|
||
|
|
"for chunk in pi_client.chat_completions(\n",
|
||
|
|
" messages=[{\"role\": \"user\", \"content\": query}],\n",
|
||
|
|
" doc_id=doc_id,\n",
|
||
|
|
" stream=True\n",
|
||
|
|
"):\n",
|
||
|
|
" print(chunk, end='', flush=True)"
|
||
|
|
]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"metadata": {
|
||
|
|
"colab": {
|
||
|
|
"provenance": []
|
||
|
|
},
|
||
|
|
"kernelspec": {
|
||
|
|
"display_name": "Python 3",
|
||
|
|
"name": "python3"
|
||
|
|
},
|
||
|
|
"language_info": {
|
||
|
|
"name": "python"
|
||
|
|
}
|
||
|
|
},
|
||
|
|
"nbformat": 4,
|
||
|
|
"nbformat_minor": 0
|
||
|
|
}
|