mirror of
https://github.com/VectifyAI/PageIndex.git
synced 2026-04-24 23:56:21 +02:00
Simplify root directory
This commit is contained in:
parent
d7d5aed668
commit
e5ac754828
10 changed files with 4 additions and 20 deletions
17
examples/tutorials/doc-search/README.md
Normal file
17
examples/tutorials/doc-search/README.md
Normal file
|
|
@ -0,0 +1,17 @@
|
|||
|
||||
|
||||
## Document Search Examples
|
||||
|
||||
|
||||
PageIndex currently enables reasoning-based RAG within a single document by default.
|
||||
For users who need to search across multiple documents, we provide three best-practice workflows for different scenarios below.
|
||||
|
||||
* [**Search by Metadata**:](metadata.md) for documents that can be distinguished by metadata.
|
||||
* [**Search by Semantics**:](semantics.md) for documents with different semantic content or cover diverse topics.
|
||||
* [**Search by Description**:](description.md) a lightweight strategy for a small number of documents.
|
||||
|
||||
|
||||
## 💬 Support
|
||||
|
||||
* 🤝 [Join our Discord](https://discord.gg/VuXuf29EUj)
|
||||
* 📨 [Contact Us](https://ii2abc2jejf.typeform.com/to/meB40zV0)
|
||||
67
examples/tutorials/doc-search/description.md
Normal file
67
examples/tutorials/doc-search/description.md
Normal file
|
|
@ -0,0 +1,67 @@
|
|||
|
||||
## Document Search by Description
|
||||
|
||||
For documents that don't have metadata, you can use LLM-generated descriptions to help with document selection. This is a lightweight approach that works best with a small number of documents.
|
||||
|
||||
|
||||
### Example Pipeline
|
||||
|
||||
|
||||
#### PageIndex Tree Generation
|
||||
Upload all documents into PageIndex to get their `doc_id` and tree structure.
|
||||
|
||||
#### Description Generation
|
||||
|
||||
Generate a description for each document based on its PageIndex tree structure and node summaries.
|
||||
```python
|
||||
prompt = f"""
|
||||
You are given a table of contents structure of a document.
|
||||
Your task is to generate a one-sentence description for the document that makes it easy to distinguish from other documents.
|
||||
|
||||
Document tree structure: {PageIndex_Tree}
|
||||
|
||||
Directly return the description, do not include any other text.
|
||||
"""
|
||||
```
|
||||
|
||||
#### Search with LLM
|
||||
|
||||
Use an LLM to select relevant documents by comparing the user query against the generated descriptions.
|
||||
|
||||
Below is a sample prompt for document selection based on their descriptions:
|
||||
|
||||
```python
|
||||
prompt = f"""
|
||||
You are given a list of documents with their IDs, file names, and descriptions. Your task is to select documents that may contain information relevant to answering the user query.
|
||||
|
||||
Query: {query}
|
||||
|
||||
Documents: [
|
||||
{
|
||||
"doc_id": "xxx",
|
||||
"doc_name": "xxx",
|
||||
"doc_description": "xxx"
|
||||
}
|
||||
]
|
||||
|
||||
Response Format:
|
||||
{{
|
||||
"thinking": "<Your reasoning for document selection>",
|
||||
"answer": <Python list of relevant doc_ids>, e.g. ['doc_id1', 'doc_id2']. Return [] if no documents are relevant.
|
||||
}}
|
||||
|
||||
Return only the JSON structure, with no additional output.
|
||||
"""
|
||||
```
|
||||
|
||||
#### Retrieve with PageIndex
|
||||
|
||||
Use the PageIndex `doc_id` of the retrieved documents to perform further retrieval via the PageIndex retrieval API.
|
||||
|
||||
|
||||
|
||||
## 💬 Help & Community
|
||||
Contact us if you need any advice on conducting document searches for your use case.
|
||||
|
||||
- 🤝 [Join our Discord](https://discord.gg/VuXuf29EUj)
|
||||
- 📨 [Leave us a message](https://ii2abc2jejf.typeform.com/to/meB40zV0)
|
||||
37
examples/tutorials/doc-search/metadata.md
Normal file
37
examples/tutorials/doc-search/metadata.md
Normal file
|
|
@ -0,0 +1,37 @@
|
|||
|
||||
|
||||
## Document Search by Metadata
|
||||
<callout>PageIndex with metadata support is in closed beta. Fill out this form to request early access to this feature.</callout>
|
||||
|
||||
For documents that can be easily distinguished by metadata, we recommend using metadata to search the documents.
|
||||
This method is ideal for the following document types:
|
||||
- Financial reports categorized by company and time period
|
||||
- Legal documents categorized by case type
|
||||
- Medical records categorized by patient or condition
|
||||
- And many others
|
||||
|
||||
In such cases, you can search documents by leveraging their metadata. A popular method is to use "Query to SQL" for document retrieval.
|
||||
|
||||
|
||||
### Example Pipeline
|
||||
|
||||
#### PageIndex Tree Generation
|
||||
Upload all documents into PageIndex to get their `doc_id`.
|
||||
|
||||
#### Set up SQL tables
|
||||
|
||||
Store documents along with their metadata and the PageIndex `doc_id` in a database table.
|
||||
|
||||
#### Query to SQL
|
||||
|
||||
Use an LLM to transform a user’s retrieval request into a SQL query to fetch relevant documents.
|
||||
|
||||
#### Retrieve with PageIndex
|
||||
|
||||
Use the PageIndex `doc_id` of the retrieved documents to perform further retrieval via the PageIndex retrieval API.
|
||||
|
||||
## 💬 Help & Community
|
||||
Contact us if you need any advice on conducting document searches for your use case.
|
||||
|
||||
- 🤝 [Join our Discord](https://discord.gg/VuXuf29EUj)
|
||||
- 📨 [Leave us a message](https://ii2abc2jejf.typeform.com/to/meB40zV0)
|
||||
41
examples/tutorials/doc-search/semantics.md
Normal file
41
examples/tutorials/doc-search/semantics.md
Normal file
|
|
@ -0,0 +1,41 @@
|
|||
## Document Search by Semantics
|
||||
|
||||
For documents that cover diverse topics, one can also use vector-based semantic search to search the documents. The procedure is slightly different from the classic vector-search-based method.
|
||||
|
||||
### Example Pipeline
|
||||
|
||||
|
||||
#### Chunking and Embedding
|
||||
Divide the documents into chunks, choose an embedding model to convert the chunks into vectors and store each vector with its corresponding `doc_id` in a vector database.
|
||||
|
||||
|
||||
#### Vector Search
|
||||
|
||||
For each query, conduct a vector-based search to get top-K chunks with their corresponding documents.
|
||||
|
||||
#### Compute Document Score
|
||||
|
||||
For each document, calculate a relevance score. Let N be the number of content chunks associated with each document, and let **ChunkScore**(n) be the relevance score of chunk n. The document score is computed as:
|
||||
|
||||
|
||||
$$
|
||||
\text{DocScore}=\frac{1}{\sqrt{N+1}}\sum_{n=1}^N \text{ChunkScore}(n)
|
||||
$$
|
||||
|
||||
- The sum aggregates relevance from all related chunks.
|
||||
- The +1 inside the square root ensures the formula handles nodes with zero chunks.
|
||||
- Using the square root in the denominator allows the score to increase with the number of relevant chunks, but with diminishing returns. This rewards documents with more relevant chunks, while preventing large nodes from dominating due to quantity alone.
|
||||
- This scoring favors documents with fewer, highly relevant chunks over those with many weakly relevant ones.
|
||||
|
||||
|
||||
#### Retrieve with PageIndex
|
||||
|
||||
Select the documents with the highest DocScore, then use their `doc_id` to perform further retrieval via the PageIndex retrieval API.
|
||||
|
||||
|
||||
|
||||
## 💬 Help & Community
|
||||
Contact us if you need any advice on conducting document searches for your use case.
|
||||
|
||||
- 🤝 [Join our Discord](https://discord.gg/VuXuf29EUj)
|
||||
- 📨 [Leave us a message](https://ii2abc2jejf.typeform.com/to/meB40zV0)
|
||||
70
examples/tutorials/tree-search/README.md
Normal file
70
examples/tutorials/tree-search/README.md
Normal file
|
|
@ -0,0 +1,70 @@
|
|||
## Tree Search Examples
|
||||
This tutorial provides a basic example of how to perform retrieval using the PageIndex tree.
|
||||
|
||||
### Basic LLM Tree Search Example
|
||||
A simple strategy is to use an LLM agent to conduct tree search. Here is a basic tree search prompt.
|
||||
|
||||
```python
|
||||
prompt = f"""
|
||||
You are given a query and the tree structure of a document.
|
||||
You need to find all nodes that are likely to contain the answer.
|
||||
|
||||
Query: {query}
|
||||
|
||||
Document tree structure: {PageIndex_Tree}
|
||||
|
||||
Reply in the following JSON format:
|
||||
{{
|
||||
"thinking": <your reasoning about which nodes are relevant>,
|
||||
"node_list": [node_id1, node_id2, ...]
|
||||
}}
|
||||
"""
|
||||
```
|
||||
<callout>
|
||||
In our dashboard and retrieval API, we use a combination of LLM tree search and value function-based Monte Carlo Tree Search ([MCTS](https://en.wikipedia.org/wiki/Monte_Carlo_tree_search)). More details will be released soon.
|
||||
</callout>
|
||||
|
||||
### Integrating User Preference or Expert Knowledge
|
||||
Unlike vector-based RAG where integrating expert knowledge or user preference requires fine-tuning the embedding model, in PageIndex, you can incorporate user preferences or expert knowledge by simply adding knowledge to the LLM tree search prompt. Here is an example pipeline.
|
||||
|
||||
|
||||
#### 1. Preference Retrieval
|
||||
|
||||
When a query is received, the system selects the most relevant user preference or expert knowledge snippets from a database or a set of domain-specific rules. This can be done using keyword matching, semantic similarity, or LLM-based relevance search.
|
||||
|
||||
#### 2. Tree Search with Preference
|
||||
Integrating preference into the tree search prompt.
|
||||
|
||||
**Enhanced Tree Search with Expert Preference Example**
|
||||
|
||||
```python
|
||||
prompt = f"""
|
||||
You are given a question and a tree structure of a document.
|
||||
You need to find all nodes that are likely to contain the answer.
|
||||
|
||||
Query: {query}
|
||||
|
||||
Document tree structure: {PageIndex_Tree}
|
||||
|
||||
Expert Knowledge of relevant sections: {Preference}
|
||||
|
||||
Reply in the following JSON format:
|
||||
{{
|
||||
"thinking": <reasoning about which nodes are relevant>,
|
||||
"node_list": [node_id1, node_id2, ...]
|
||||
}}
|
||||
"""
|
||||
```
|
||||
|
||||
**Example Expert Preference**
|
||||
> If the query mentions EBITDA adjustments, prioritize Item 7 (MD&A) and footnotes in Item 8 (Financial Statements) in 10-K reports.
|
||||
|
||||
|
||||
|
||||
By integrating user or expert preferences, node search becomes more targeted and effective, leveraging both the document structure and domain-specific insights.
|
||||
|
||||
## 💬 Help & Community
|
||||
Contact us if you need any advice on conducting document searches for your use case.
|
||||
|
||||
- 🤝 [Join our Discord](https://discord.gg/VuXuf29EUj)
|
||||
- 📨 [Leave us a message](https://ii2abc2jejf.typeform.com/to/tK3AXl8T)
|
||||
Loading…
Add table
Add a link
Reference in a new issue