first commit

This commit is contained in:
mingtian 2025-04-01 18:54:08 +08:00
commit 6f43b477d3
17 changed files with 4529 additions and 0 deletions

15
.gitignore vendored Normal file
View file

@ -0,0 +1,15 @@
.ipynb_checkpoints
__pycache__
files
index
temp/*
chroma-collections.parquet
chroma-embeddings.parquet
.DS_Store
.env*
notebook
SDK/*
log/*
logs/
parts/*
json_results/*

21
LICENSE Normal file
View file

@ -0,0 +1,21 @@
MIT License
Copyright (c) 2025 Vectify AI
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

136
README.md Normal file
View file

@ -0,0 +1,136 @@
# PageIndex
### **Document Index System for Reasoning-Based RAG**
Traditional vector-based retrieval relies heavily on semantic similarity. But when working with professional documents that require domain expertise and multi-step reasoning, similarity search often falls short.
**Reasoning-Based RAG** offers a better alternative: enabling LLMs to *think* and *reason* their way to the most relevant document sections. Inspired by **AlphaGo**, we leverage **tree search** to perform structured document retrieval.
**PageIndex** is an indexing system that builds search trees from long documents, making them ready for reasoning-based RAG.
Built by [Vectify AI](https://vectify.ai/pageindex)
---
## 🔍 What is PageIndex?
**PageIndex** transforms lengthy PDF documents into a semantic **tree structure**, similar to a "table of contents" but optimized for use with Large Language Models (LLMs).
Its ideal for: financial reports, regulatory filings, academic textbooks, legal or technical manuals or any document that exceeds LLM context limits.
### ✅ Key Features
- **Scales to Massive Documents**
Designed to handle hundreds or even thousands of pages with ease.
- **Hierarchical Tree Structure**
Enables LLMs to traverse documents logically—like an intelligent, LLM-optimized table of contents.
- **Precise Page Referencing**
Every node contains its own summary and start/end page physical index, allowing pinpoint retrieval.
- **Chunk-Free Segmentation**
No arbitrary chunking. Nodes follow the natural structure of the document.
---
## 📦 PageIndex Format
Here is an example output. See more [example documents](https://github.com/VectifyAI/PageIndex/tree/main/docs) and [generated trees](https://github.com/VectifyAI/PageIndex/tree/main/results).
```json
{
"title": "Financial Stability",
"node_id": "0006",
"start_index": 21,
"end_index": 22,
"summary": "The Federal Reserve ...",
"child_nodes": [
{
"title": "Monitoring Financial Vulnerabilities",
"node_id": "0007",
"start_index": 22,
"end_index": 28,
"summary": "The Federal Reserve's monitoring ..."
},
{
"title": "Domestic and International Cooperation and Coordination",
"node_id": "0008",
"start_index": 28,
"end_index": 31,
"summary": "In 2023, the Federal Reserve collaborated ..."
}
]
}
```
Notice: the node_id and summary generation function will be added soon.
## 🧠 Reasoning-Based RAG with PageIndex
Use PageIndex to build **reasoning-based retrieval systems** without relying on semantic similarity. Great for domain-specific tasks where nuance matters.
### 🛠️ Example Prompt
```python
prompt = f"""
You are given a question and a tree structure of a document.
You need to find all nodes that are likely to contain the answer.
Question: {question}
Document tree structure: {structure}
Reply in the following JSON format:
{{
"thinking": <reasoning about where to look>,
"node_list": [node_id1, node_id2, ...]
}}
"""
```
## 🚀 Usage
Follow these steps to generate a PageIndex tree from a PDF document.
### 1. Install dependencies
```bash
pip3 install -r requirements.txt
```
### 2. Set your OpenAI API key
Create a `.env` file in the root directory and add your API key:
```bash
CHATGPT_API_KEY=your_openai_key_here
```
### 3. Run PageIndex on your PDF
```bash
python3 page_index.py --pdf_path /path/to/your/document.pdf
```
The results will be saved in the `./results/` directory.
## 🛤 Roadmap
- [ ] Add node summary and document selection
- [ ] Technical report on PageIndex design
- [ ] Efficient tree search algorithms for large documents
- [ ] Integration with vector-based semantic retrieval
## 📈 Case Study: Mafin 2.5
[Mafin 2.5](https://vectify.ai/blog/Mafin2.5) is a state-of-the-art reasoning-based RAG model designed specifically for financial document analysis. Built on top of **PageIndex**, it achieved an impressive **98.7% accuracy** on the [FinanceBench](https://github.com/VectifyAI/Mafin2.5-FinanceBench) benchmark—significantly outperforming traditional vector-based RAG systems.
PageIndexs hierarchical indexing enabled precise navigation and extraction of relevant content from complex financial reports, such as SEC filings and earnings disclosures.
👉 See full [benchmark results](https://github.com/VectifyAI/Mafin2.5-FinanceBench) for detailed comparisons and performance metrics.
## 📬 Contact Us
Need customized support for your documents or reasoning-based RAG system?
👉 [Contact us here](https://ii2abc2jejf.typeform.com/to/meB40zV0)

0
__init__.py Normal file
View file

BIN
docs/2023-annual-report.pdf Normal file

Binary file not shown.

BIN
docs/PRML.pdf Normal file

Binary file not shown.

Binary file not shown.

Binary file not shown.

BIN
docs/q1-fy25-earnings.pdf Normal file

Binary file not shown.

1073
page_index.py Normal file

File diff suppressed because it is too large Load diff

5
requirements.txt Normal file
View file

@ -0,0 +1,5 @@
openai==1.70.0
pymupdf==1.25.5
PyPDF2==3.0.1
python-dotenv==1.1.0
tiktoken==0.7.0

View file

@ -0,0 +1,460 @@
[
{
"title": "Preface",
"start_index": 1,
"end_index": 4
},
{
"title": "About the Federal Reserve",
"start_index": 5,
"end_index": 7
},
{
"title": "Overview",
"start_index": 7,
"end_index": 8
},
{
"title": "Monetary Policy and Economic Developments",
"start_index": 9,
"end_index": 9,
"child_nodes": [
{
"title": "March 2024 Summary",
"start_index": 9,
"end_index": 14
},
{
"title": "June 2023 Summary",
"start_index": 15,
"end_index": 20
}
]
},
{
"title": "Financial Stability",
"start_index": 21,
"end_index": 21,
"child_nodes": [
{
"title": "Monitoring Financial Vulnerabilities",
"start_index": 22,
"end_index": 28
},
{
"title": "Domestic and International Cooperation and Coordination",
"start_index": 28,
"end_index": 31
}
]
},
{
"title": "Supervision and Regulation",
"start_index": 31,
"end_index": 31,
"child_nodes": [
{
"title": "Supervised and Regulated Institutions",
"start_index": 32,
"end_index": 35
},
{
"title": "Supervisory Developments",
"start_index": 35,
"end_index": 54
},
{
"title": "Regulatory Developments",
"start_index": 55,
"end_index": 59
}
]
},
{
"title": "Payment System and Reserve Bank Oversight",
"start_index": 59,
"end_index": 59,
"child_nodes": [
{
"title": "Payment Services to Depository and Other Institutions",
"start_index": 60,
"end_index": 65
},
{
"title": "Currency and Coin",
"start_index": 66,
"end_index": 68
},
{
"title": "Fiscal Agency and Government Depository Services",
"start_index": 69,
"end_index": 72
},
{
"title": "Evolutions and Improvements to the System",
"start_index": 72,
"end_index": 75
},
{
"title": "Oversight of Federal Reserve Banks",
"start_index": 75,
"end_index": 81
},
{
"title": "Pro Forma Financial Statements for Federal Reserve Priced Services",
"start_index": 82,
"end_index": 88
}
]
},
{
"title": "Consumer and Community Affairs",
"start_index": 89,
"end_index": 89,
"child_nodes": [
{
"title": "Consumer Compliance Supervision",
"start_index": 89,
"end_index": 101
},
{
"title": "Consumer Laws and Regulations",
"start_index": 101,
"end_index": 102
},
{
"title": "Consumer Research and Analysis of Emerging Issues and Policy",
"start_index": 102,
"end_index": 105
},
{
"title": "Community Development",
"start_index": 105,
"end_index": 106
}
]
},
{
"title": "Appendixes",
"start_index": 107,
"end_index": 108
},
{
"title": "Federal Reserve System Organization",
"start_index": 109,
"end_index": 109,
"child_nodes": [
{
"title": "Board of Governors",
"start_index": 109,
"end_index": 116
},
{
"title": "Federal Open Market Committee",
"start_index": 117,
"end_index": 118
},
{
"title": "Board of Governors Advisory Councils",
"start_index": 119,
"end_index": 122
},
{
"title": "Federal Reserve Banks and Branches",
"start_index": 123,
"end_index": 146
}
]
},
{
"title": "Minutes of Federal Open Market Committee Meetings",
"start_index": 147,
"end_index": 147,
"child_nodes": [
{
"title": "Meeting Minutes",
"start_index": 147,
"end_index": 149
}
]
},
{
"title": "Federal Reserve System Audits",
"start_index": 149,
"end_index": 149,
"child_nodes": [
{
"title": "Office of Inspector General Activities",
"start_index": 149,
"end_index": 151
},
{
"title": "Government Accountability Office Reviews",
"start_index": 151,
"end_index": 153
}
]
},
{
"title": "Federal Reserve System Budgets",
"start_index": 153,
"end_index": 153,
"child_nodes": [
{
"title": "System Budgets Overview",
"start_index": 153,
"end_index": 157
},
{
"title": "Board of Governors Budgets",
"start_index": 157,
"end_index": 163
},
{
"title": "Federal Reserve Banks Budgets",
"start_index": 163,
"end_index": 169
},
{
"title": "Currency Budget",
"start_index": 169,
"end_index": 174
}
]
},
{
"title": "Record of Policy Actions of the Board of Governors",
"start_index": 175,
"end_index": 175,
"child_nodes": [
{
"title": "Rules and Regulations",
"start_index": 175,
"end_index": 176
},
{
"title": "Policy Statements and Other Actions",
"start_index": 177,
"end_index": 181
},
{
"title": "Discount Rates for Depository Institutions in 2023",
"start_index": 181,
"end_index": 183
},
{
"title": "The Board of Governors and the Government Performance and Results Act",
"start_index": 184,
"end_index": 184
}
]
},
{
"title": "Litigation",
"start_index": 185,
"end_index": 185,
"child_nodes": [
{
"title": "Pending",
"start_index": 185,
"end_index": 186
},
{
"title": "Resolved",
"start_index": 186,
"end_index": 186
}
]
},
{
"title": "Statistical Tables",
"start_index": 187,
"end_index": 187,
"child_nodes": [
{
"title": "Federal Reserve open market transactions, 2023",
"start_index": 187,
"end_index": 187,
"child_nodes": [
{
"title": "Federal Reserve open market transactions, 2023\u2014continued",
"start_index": 187,
"end_index": 188
}
]
},
{
"title": "Federal Reserve Bank holdings of U.S. Treasury and federal agency securities, December 31, 2021\u201323",
"start_index": 189,
"end_index": 188,
"child_nodes": [
{
"title": "Federal Reserve Bank holdings of U.S. Treasury and federal agency securities, December 31, 2021\u201323\u2014continued",
"start_index": 189,
"end_index": 190
}
]
},
{
"title": "Reserve requirements of depository institutions, December 31, 2023",
"start_index": 191,
"end_index": 191
},
{
"title": "Banking offices and banks affiliated with bank holding companies in the United States, December 31, 2022 and 2023",
"start_index": 192,
"end_index": 192
},
{
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1984\u20132023 and month-end 2023",
"start_index": 193,
"end_index": 194,
"child_nodes": [
{
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1984\u20132023 and month-end 2023\u2014continued",
"start_index": 194,
"end_index": 194
},
{
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1984\u20132023 and month-end 2023\u2014continued",
"start_index": 195,
"end_index": 196
},
{
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1984\u20132023 and month-end 2023\u2014continued",
"start_index": 196,
"end_index": 196
}
]
},
{
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1918\u20131983",
"start_index": 197,
"end_index": 198,
"child_nodes": [
{
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1918\u20131983\u2014continued",
"start_index": 199,
"end_index": 198
},
{
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1918\u20131983\u2014continued",
"start_index": 199,
"end_index": 198
},
{
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1918\u20131983\u2014continued",
"start_index": 199,
"end_index": 200
}
]
},
{
"title": "Principal assets and liabilities of insured commercial banks, by class of bank, June 30, 2023 and 2022",
"start_index": 201,
"end_index": 201
},
{
"title": "Initial margin requirements under Regulations T, U, and X",
"start_index": 202,
"end_index": 203
},
{
"title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022",
"start_index": 203,
"end_index": 206,
"child_nodes": [
{
"title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022\u2014continued",
"start_index": 206,
"end_index": 206
},
{
"title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022\u2014continued",
"start_index": 206,
"end_index": 206
},
{
"title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022\u2014continued",
"start_index": 206,
"end_index": 206
},
{
"title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022\u2014continued",
"start_index": 206,
"end_index": 206
},
{
"title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022\u2014continued",
"start_index": 206,
"end_index": 209
}
]
},
{
"title": "Statement of condition of the Federal Reserve Banks, December 31, 2023 and 2022",
"start_index": 209,
"end_index": 210
},
{
"title": "Income and expenses of the Federal Reserve Banks, by Bank, 2023",
"start_index": 210,
"end_index": 211,
"child_nodes": [
{
"title": "Income and expenses of the Federal Reserve Banks, by Bank, 2023\u2014continued",
"start_index": 211,
"end_index": 212
},
{
"title": "Income and expenses of the Federal Reserve Banks, by Bank, 2023\u2014continued",
"start_index": 212,
"end_index": 212
},
{
"title": "Income and expenses of the Federal Reserve Banks, by Bank, 2023\u2014continued",
"start_index": 212,
"end_index": 214
}
]
},
{
"title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023",
"start_index": 214,
"end_index": 214,
"child_nodes": [
{
"title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023\u2014continued",
"start_index": 214,
"end_index": 214
},
{
"title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023\u2014continued",
"start_index": 214,
"end_index": 217
},
{
"title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023\u2014continued",
"start_index": 217,
"end_index": 217
}
]
},
{
"title": "Operations in principal departments of the Federal Reserve Banks, 2020\u201323",
"start_index": 218,
"end_index": 218
},
{
"title": "Number and annual salaries of officers and employees of the Federal Reserve Banks, December 31, 2023",
"start_index": 219,
"end_index": 220
},
{
"title": "Acquisition costs and net book value of the premises of the Federal Reserve Banks and Branches, December 31, 2023",
"start_index": 220,
"end_index": 222
}
]
}
]

1558
results/PRML_structure.json Normal file

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,51 @@
[
{
"title": "Preface",
"start_index": 1,
"end_index": 2
},
{
"title": "Introduction",
"start_index": 2,
"end_index": 6
},
{
"title": "Interpretation and Application",
"start_index": 6,
"end_index": 8,
"child_nodes": [
{
"title": "Historical Context and Legislative History",
"start_index": 8,
"end_index": 10
},
{
"title": "Scope of the Solely Incidental Prong of the Broker-Dealer Exclusion",
"start_index": 10,
"end_index": 14
},
{
"title": "Guidance on Applying the Interpretation of the Solely Incidental Prong",
"start_index": 14,
"end_index": 22
}
]
},
{
"title": "Economic Considerations",
"start_index": 22,
"end_index": 22,
"child_nodes": [
{
"title": "Background",
"start_index": 22,
"end_index": 23
},
{
"title": "Potential Economic Effects",
"start_index": 23,
"end_index": 28
}
]
}
]

View file

@ -0,0 +1,466 @@
[
{
"title": "Preface",
"start_index": 1,
"end_index": 6
},
{
"title": "INTRODUCTION",
"start_index": 6,
"end_index": 12,
"child_nodes": [
{
"title": "Background",
"start_index": 12,
"end_index": 22,
"child_nodes": [
{
"title": "Evaluation of Standards of Conduct Applicable to Investment Advice",
"start_index": 22,
"end_index": 26
},
{
"title": "DOL Rulemaking",
"start_index": 26,
"end_index": 32
},
{
"title": "Statement by Chairman Clayton",
"start_index": 32,
"end_index": 36
}
]
},
{
"title": "General Objectives of Proposed Approach",
"start_index": 36,
"end_index": 44
}
]
},
{
"title": "DISCUSSION OF REGULATION BEST INTEREST",
"start_index": 44,
"end_index": 44,
"child_nodes": [
{
"title": "Overview of Regulation Best Interest",
"start_index": 44,
"end_index": 50
},
{
"title": "Best Interest, Generally",
"start_index": 50,
"end_index": 58,
"child_nodes": [
{
"title": "Consistency with Other Approaches",
"start_index": 58,
"end_index": 66
},
{
"title": "Request for Comment on the Best Interest Obligation",
"start_index": 66,
"end_index": 71
}
]
},
{
"title": "Key Terms and Scope of Best Interest Obligation",
"start_index": 71,
"end_index": 71,
"child_nodes": [
{
"title": "Natural Person who is an Associated Person",
"start_index": 71,
"end_index": 72
},
{
"title": "When Making a Recommendation, At Time Recommendation is Made",
"start_index": 72,
"end_index": 82
},
{
"title": "Any Securities Transaction or Investment Strategy",
"start_index": 82,
"end_index": 83
},
{
"title": "Retail Customer",
"start_index": 83,
"end_index": 90
},
{
"title": "Request for Comment on Key Terms and Scope of Best Interest Obligation",
"start_index": 90,
"end_index": 96
}
]
},
{
"title": "Components of Regulation Best Interest",
"start_index": 96,
"end_index": 97,
"child_nodes": [
{
"title": "Disclosure Obligation",
"start_index": 97,
"end_index": 133
},
{
"title": "Care Obligation",
"start_index": 133,
"end_index": 166
},
{
"title": "Conflict of Interest Obligations",
"start_index": 166,
"end_index": 196
}
]
},
{
"title": "Recordkeeping and Retention",
"start_index": 196,
"end_index": 199
},
{
"title": "Whether the Exercise of Investment Discretion Should be Viewed as Solely Incidental to the Business of a Broker or Dealer",
"start_index": 199,
"end_index": 209
}
]
},
{
"title": "REQUEST FOR COMMENT",
"start_index": 209,
"end_index": 210,
"child_nodes": [
{
"title": "Generally",
"start_index": 210,
"end_index": 212
},
{
"title": "Interactions with Other Standards of Conduct",
"start_index": 212,
"end_index": 214
}
]
},
{
"title": "ECONOMIC ANALYSIS",
"start_index": 214,
"end_index": 214,
"child_nodes": [
{
"title": "Introduction, Primary Goals of Proposed Regulations and Broad Economic Considerations",
"start_index": 214,
"end_index": 214,
"child_nodes": [
{
"title": "Introduction and Primary Goals of Proposed Regulation",
"start_index": 214,
"end_index": 215
},
{
"title": "Broad Economic Considerations",
"start_index": 215,
"end_index": 225
}
]
},
{
"title": "Economic Baseline",
"start_index": 225,
"end_index": 225,
"child_nodes": [
{
"title": "Market for Advice Services",
"start_index": 225,
"end_index": 246
},
{
"title": "Regulatory Baseline",
"start_index": 246,
"end_index": 255
}
]
},
{
"title": "Benefits, Costs, and Effects on Efficiency, Competition, and Capital Formation",
"start_index": 255,
"end_index": 258,
"child_nodes": [
{
"title": "Benefits",
"start_index": 258,
"end_index": 272
},
{
"title": "Costs",
"start_index": 272,
"end_index": 275,
"child_nodes": [
{
"title": "Standard of Conduct Defined as Best Interest",
"start_index": 275,
"end_index": 275,
"child_nodes": [
{
"title": "Operational Costs",
"start_index": 275,
"end_index": 277
},
{
"title": "Programmatic Costs",
"start_index": 278,
"end_index": 280
}
]
},
{
"title": "Disclosure Obligation",
"start_index": 280,
"end_index": 286
},
{
"title": "Obligation to Exercise Reasonable Diligence, Care, Skill, and Prudence in Making a Recommendation",
"start_index": 286,
"end_index": 290
},
{
"title": "Obligation to Establish, Maintain, and Enforce Written Policies and Procedures Reasonably Designed to Identify and at a Minimum Disclose, or Eliminate, All Material Conflicts of Interest Associated with a Recommendation",
"start_index": 290,
"end_index": 295,
"child_nodes": [
{
"title": "Eliminate Material Conflicts of Interest Associated with a Recommendation",
"start_index": 295,
"end_index": 297
},
{
"title": "At a Minimum Disclose Material Conflicts of Interest Associated with a Recommendation",
"start_index": 297,
"end_index": 299
}
]
},
{
"title": "Obligation to Establish, Maintain, and Enforce Written Policies and Procedures Reasonably Designed to Identify and Disclose and Mitigate, or Eliminate, Material Conflicts of Interest Arising from Financial Incentives Associated with a Recommendation",
"start_index": 299,
"end_index": 300,
"child_nodes": [
{
"title": "Eliminate Material Conflicts Arising from Financial Incentives Associated with a Recommendation",
"start_index": 300,
"end_index": 304
},
{
"title": "Disclose and Mitigate Material Conflicts of Interest Arising from Financial Incentives Associated with a Recommendation",
"start_index": 304,
"end_index": 316
}
]
}
]
}
]
},
{
"title": "Effects on Efficiency, Competition, and Capital Formation",
"start_index": 316,
"end_index": 324
},
{
"title": "Reasonable Alternatives",
"start_index": 324,
"end_index": 325,
"child_nodes": [
{
"title": "Disclosure-Only Alternative",
"start_index": 325,
"end_index": 327
},
{
"title": "Principles-Based Standard of Conduct Obligation",
"start_index": 327,
"end_index": 328
},
{
"title": "A Fiduciary Standard for Broker-Dealers",
"start_index": 328,
"end_index": 332
},
{
"title": "Enhanced Standards Akin to Conditions of the BIC Exemption",
"start_index": 332,
"end_index": 335
}
]
},
{
"title": "Request for Comment",
"start_index": 335,
"end_index": 338
}
]
},
{
"title": "PAPERWORK REDUCTION ACT ANALYSIS",
"start_index": 338,
"end_index": 340,
"child_nodes": [
{
"title": "Respondents Subject to Proposed Regulation Best Interest and Proposed Amendments to Rule 17a-3(a)(25), Rule 17a-4(e)(5)",
"start_index": 340,
"end_index": 340,
"child_nodes": [
{
"title": "Broker-Dealers",
"start_index": 340,
"end_index": 340
},
{
"title": "Natural Persons Who Are Associated Persons of Broker-Dealers",
"start_index": 340,
"end_index": 341
}
]
},
{
"title": "Summary of Collections of Information",
"start_index": 341,
"end_index": 342,
"child_nodes": [
{
"title": "Conflict of Interest Obligations",
"start_index": 342,
"end_index": 353
},
{
"title": "Disclosure Obligation",
"start_index": 353,
"end_index": 370
},
{
"title": "Care Obligation",
"start_index": 370,
"end_index": 370
},
{
"title": "Record-Making and Recordkeeping Obligations",
"start_index": 370,
"end_index": 375
}
]
},
{
"title": "Collection of Information is Mandatory",
"start_index": 375,
"end_index": 375
},
{
"title": "Confidentiality",
"start_index": 375,
"end_index": 376
},
{
"title": "Request for Comment",
"start_index": 376,
"end_index": 377
}
]
},
{
"title": "SMALL BUSINESS REGULATORY ENFORCEMENT FAIRNESS ACT",
"start_index": 377,
"end_index": 378
},
{
"title": "INITIAL REGULATORY FLEXIBILITY ACT ANALYSIS",
"start_index": 378,
"end_index": 379,
"child_nodes": [
{
"title": "Reasons for and Objectives of the Proposed Action",
"start_index": 379,
"end_index": 381
},
{
"title": "Legal Basis",
"start_index": 381,
"end_index": 381
},
{
"title": "Small Entities Subject to the Proposed Rule",
"start_index": 381,
"end_index": 382
},
{
"title": "Projected Compliance Requirements of the Proposed Rule for Small Entities",
"start_index": 382,
"end_index": 383,
"child_nodes": [
{
"title": "Conflict of Interest Obligations",
"start_index": 383,
"end_index": 386
},
{
"title": "Disclosure Obligations",
"start_index": 387,
"end_index": 394
},
{
"title": "Obligation to Exercise Reasonable Diligence, Care, Skill and Prudence",
"start_index": 394,
"end_index": 394
},
{
"title": "Record-Making and Recordkeeping Obligations",
"start_index": 394,
"end_index": 397
}
]
},
{
"title": "Duplicative, Overlapping, or Conflicting Federal Rules",
"start_index": 397,
"end_index": 398
},
{
"title": "Significant Alternatives",
"start_index": 398,
"end_index": 401,
"child_nodes": [
{
"title": "Disclosure-Only Alternative",
"start_index": 401,
"end_index": 401
},
{
"title": "Principles-Based Alternative",
"start_index": 401,
"end_index": 402
},
{
"title": "Enhanced Standards Akin to BIC Exemption",
"start_index": 402,
"end_index": 403
}
]
},
{
"title": "General Request for Comment",
"start_index": 403,
"end_index": 403
}
]
},
{
"title": "STATUTORY AUTHORITY AND TEXT OF PROPOSED RULE",
"start_index": 403,
"end_index": 408
}
]

View file

@ -0,0 +1,220 @@
[
{
"title": "THE WALT DISNEY COMPANY REPORTS FIRST QUARTER EARNINGS FOR FISCAL 2025",
"start_index": 1,
"end_index": 1,
"child_nodes": [
{
"title": "Financial Results for the Quarter",
"start_index": 1,
"end_index": 1,
"child_nodes": [
{
"title": "Key Points",
"start_index": 1,
"end_index": 1
}
]
},
{
"title": "Guidance and Outlook",
"start_index": 2,
"end_index": 2,
"child_nodes": [
{
"title": "Star India deconsolidated in Q1",
"start_index": 2,
"end_index": 2
},
{
"title": "Q2 Fiscal 2025",
"start_index": 2,
"end_index": 2
},
{
"title": "Fiscal Year 2025",
"start_index": 2,
"end_index": 2
}
]
},
{
"title": "Message From Our CEO",
"start_index": 2,
"end_index": 2
},
{
"title": "SUMMARIZED FINANCIAL RESULTS",
"start_index": 3,
"end_index": 3,
"child_nodes": [
{
"title": "SUMMARIZED SEGMENT FINANCIAL RESULTS",
"start_index": 3,
"end_index": 3
}
]
},
{
"title": "DISCUSSION OF FIRST QUARTER SEGMENT RESULTS",
"start_index": 4,
"end_index": 4,
"child_nodes": [
{
"title": "Star India",
"start_index": 4,
"end_index": 4
},
{
"title": "Entertainment",
"start_index": 4,
"end_index": 4,
"child_nodes": [
{
"title": "Linear Networks",
"start_index": 5,
"end_index": 5
},
{
"title": "Direct-to-Consumer",
"start_index": 5,
"end_index": 7
},
{
"title": "Content Sales/Licensing and Other",
"start_index": 7,
"end_index": 7
}
]
},
{
"title": "Sports",
"start_index": 7,
"end_index": 7,
"child_nodes": [
{
"title": "Domestic ESPN",
"start_index": 8,
"end_index": 8
},
{
"title": "International ESPN",
"start_index": 8,
"end_index": 8
},
{
"title": "Star India",
"start_index": 8,
"end_index": 8
}
]
},
{
"title": "Experiences",
"start_index": 9,
"end_index": 9,
"child_nodes": [
{
"title": "Domestic Parks and Experiences",
"start_index": 9,
"end_index": 9
},
{
"title": "International Parks and Experiences",
"start_index": 9,
"end_index": 9
}
]
}
]
},
{
"title": "OTHER FINANCIAL INFORMATION",
"start_index": 9,
"end_index": 9,
"child_nodes": [
{
"title": "Corporate and Unallocated Shared Expenses",
"start_index": 9,
"end_index": 9
},
{
"title": "Restructuring and Impairment Charges",
"start_index": 9,
"end_index": 9
},
{
"title": "Interest Expense, net",
"start_index": 10,
"end_index": 10
},
{
"title": "Equity in the Income of Investees",
"start_index": 10,
"end_index": 10
},
{
"title": "Income Taxes",
"start_index": 10,
"end_index": 10
},
{
"title": "Noncontrolling Interests",
"start_index": 11,
"end_index": 11
},
{
"title": "Cash from Operations",
"start_index": 11,
"end_index": 11
},
{
"title": "Capital Expenditures",
"start_index": 12,
"end_index": 12
},
{
"title": "Depreciation Expense",
"start_index": 12,
"end_index": 12
}
]
},
{
"title": "THE WALT DISNEY COMPANY CONDENSED CONSOLIDATED STATEMENTS OF INCOME",
"start_index": 13,
"end_index": 13
},
{
"title": "THE WALT DISNEY COMPANY CONDENSED CONSOLIDATED BALANCE SHEETS",
"start_index": 14,
"end_index": 14
},
{
"title": "THE WALT DISNEY COMPANY CONDENSED CONSOLIDATED STATEMENTS OF CASH FLOWS",
"start_index": 15,
"end_index": 15
},
{
"title": "DTC PRODUCT DESCRIPTIONS AND KEY DEFINITIONS",
"start_index": 16,
"end_index": 16
},
{
"title": "NON-GAAP FINANCIAL MEASURES",
"start_index": 17,
"end_index": 20
},
{
"title": "FORWARD-LOOKING STATEMENTS",
"start_index": 21,
"end_index": 21
},
{
"title": "PREPARED EARNINGS REMARKS AND CONFERENCE CALL INFORMATION",
"start_index": 22,
"end_index": 22
}
]
}
]

524
utils.py Normal file
View file

@ -0,0 +1,524 @@
import tiktoken
import openai
import logging
import os
from datetime import datetime
import time
import json
import PyPDF2
import copy
import asyncio
import pymupdf
from io import BytesIO
import logging
def count_tokens(text, model):
enc = tiktoken.encoding_for_model(model)
tokens = enc.encode(text)
return len(tokens)
def ChatGPT_API_with_finish_reason(model, prompt, api_key, chat_history=None):
max_retries = 10
client = openai.OpenAI(api_key=api_key)
for i in range(max_retries):
try:
if chat_history:
messages = chat_history
messages.append({"role": "user", "content": prompt})
else:
messages = [{"role": "user", "content": prompt}]
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=0,
)
if response.choices[0].finish_reason == "length":
return response.choices[0].message.content, "max_output_reached"
else:
return response.choices[0].message.content, "finished"
except Exception as e:
print('************* Retrying *************')
logging.error(f"Error: {e}")
if i < max_retries - 1:
time.sleep(1) # Wait for 1秒 before retrying
else:
logging.error('Max retries reached for prompt: ' + prompt)
return "Error"
def ChatGPT_API(model, prompt, api_key, chat_history=None):
max_retries = 10
client = openai.OpenAI(api_key=api_key)
for i in range(max_retries):
try:
if chat_history:
messages = chat_history
messages.append({"role": "user", "content": prompt})
else:
messages = [{"role": "user", "content": prompt}]
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=0,
)
return response.choices[0].message.content
except Exception as e:
print('************* Retrying *************')
logging.error(f"Error: {e}")
if i < max_retries - 1:
time.sleep(1) # Wait for 1秒 before retrying
else:
logging.error('Max retries reached for prompt: ' + prompt)
return "Error"
async def ChatGPT_API_async(model, prompt, api_key):
max_retries = 10
client = openai.AsyncOpenAI(api_key=api_key)
for i in range(max_retries):
try:
messages = [{"role": "user", "content": prompt}]
response = await client.chat.completions.create(
model=model,
messages=messages,
temperature=0,
)
return response.choices[0].message.content
except Exception as e:
print('************* Retrying *************')
logging.error(f"Error: {e}")
if i < max_retries - 1:
await asyncio.sleep(1) # Wait for 1秒 before retrying
else:
logging.error('Max retries reached for prompt: ' + prompt)
return "Error"
def get_json_content(response):
start_idx = response.find("```json")
if start_idx != -1:
start_idx += 7
response = response[start_idx:]
end_idx = response.rfind("```")
if end_idx != -1:
response = response[:end_idx]
json_content = response.strip()
return json_content
def extract_json(content):
try:
# First, try to extract JSON enclosed within ```json and ```
start_idx = content.find("```json")
if start_idx != -1:
start_idx += 7 # Adjust index to start after the delimiter
end_idx = content.rfind("```")
json_content = content[start_idx:end_idx].strip()
else:
# If no delimiters, assume entire content could be JSON
json_content = content.strip()
# Clean up common issues that might cause parsing errors
json_content = json_content.replace('None', 'null') # Replace Python None with JSON null
json_content = json_content.replace('\n', ' ').replace('\r', ' ') # Remove newlines
json_content = ' '.join(json_content.split()) # Normalize whitespace
# Attempt to parse and return the JSON object
return json.loads(json_content)
except json.JSONDecodeError as e:
logging.error(f"Failed to extract JSON: {e}")
# Try to clean up the content further if initial parsing fails
try:
# Remove any trailing commas before closing brackets/braces
json_content = json_content.replace(',]', ']').replace(',}', '}')
return json.loads(json_content)
except:
logging.error("Failed to parse JSON even after cleanup")
return {}
except Exception as e:
logging.error(f"Unexpected error while extracting JSON: {e}")
return {}
def write_node_id(data, node_id=0):
if isinstance(data, dict):
data['node_id'] = str(node_id).zfill(4)
node_id += 1
for key in list(data.keys()):
if 'child_nodes' in key:
node_id = write_node_id(data[key], node_id)
elif isinstance(data, list):
for index in range(len(data)):
node_id = write_node_id(data[index], node_id)
return node_id
def get_nodes(structure):
if isinstance(structure, dict):
structure_node = copy.deepcopy(structure)
structure_node.pop('child_nodes', None)
nodes = [structure_node]
for key in list(structure.keys()):
if 'child_nodes' in key:
nodes.extend(get_nodes(structure[key]))
return nodes
elif isinstance(structure, list):
nodes = []
for item in structure:
nodes.extend(get_nodes(item))
return nodes
def structure_to_list(structure):
if isinstance(structure, dict):
nodes = []
nodes.append(structure)
if 'child_nodes' in structure:
nodes.extend(structure_to_list(structure['child_nodes']))
return nodes
elif isinstance(structure, list):
nodes = []
for item in structure:
nodes.extend(structure_to_list(item))
return nodes
def get_leaf_nodes(structure):
if isinstance(structure, dict):
if not structure['child_nodes']:
structure_node = copy.deepcopy(structure)
structure_node.pop('child_nodes', None)
return [structure_node]
else:
leaf_nodes = []
for key in list(structure.keys()):
if 'child_nodes' in key:
leaf_nodes.extend(get_leaf_nodes(structure[key]))
return leaf_nodes
elif isinstance(structure, list):
leaf_nodes = []
for item in structure:
leaf_nodes.extend(get_leaf_nodes(item))
return leaf_nodes
def is_leaf_node(data, node_id):
# Helper function to find the node by its node_id
def find_node(data, node_id):
if isinstance(data, dict):
if data.get('node_id') == node_id:
return data
for key in data.keys():
if 'child_nodes' in key:
result = find_node(data[key], node_id)
if result:
return result
elif isinstance(data, list):
for item in data:
result = find_node(item, node_id)
if result:
return result
return None
# Find the node with the given node_id
node = find_node(data, node_id)
# Check if the node is a leaf node
if node and not node.get('child_nodes'):
return True
return False
def get_last_node(structure):
return structure[-1]
def extract_text_from_pdf(pdf_path):
pdf_reader = PyPDF2.PdfReader(pdf_path)
###return text not list
text=""
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
text+=page.extract_text()
return text
def get_pdf_title(pdf_path):
pdf_reader = PyPDF2.PdfReader(pdf_path)
meta = pdf_reader.metadata
title = meta.title
return title
def get_text_of_pages(pdf_path, start_page, end_page, tag=True):
pdf_reader = PyPDF2.PdfReader(pdf_path)
text = ""
for page_num in range(start_page-1, end_page):
page = pdf_reader.pages[page_num]
page_text = page.extract_text()
if tag:
text += f"<start_index_{page_num+1}>\n{page_text}\n<end_index_{page_num+1}>\n"
else:
text += page_text
return text
def get_first_start_page_from_text(text):
start_page = -1
start_page_match = re.search(r'<start_index_(\d+)>', text)
if start_page_match:
start_page = int(start_page_match.group(1))
return start_page
def get_last_start_page_from_text(text):
start_page = -1
# Find all matches of start_index tags
start_page_matches = re.finditer(r'<start_index_(\d+)>', text)
# Convert iterator to list and get the last match if any exist
matches_list = list(start_page_matches)
if matches_list:
start_page = int(matches_list[-1].group(1))
return start_page
def sanitize_filename(filename, replacement='-'):
# In Linux, only '/' and '\0' (null) are invalid in filenames.
# Null can't be represented in strings, so we only handle '/'.
return filename.replace('/', replacement)
class JsonLogger:
def __init__(self, file_path):
# Extract PDF name without extension for logger name and filename
# pdf_name = os.path.splitext(os.path.basename(file_path))[0]
if isinstance(file_path, str):
pdf_name = os.path.splitext(os.path.basename(file_path))[0]
elif isinstance(file_path, BytesIO):
pdf_reader = PyPDF2.PdfReader(file_path)
meta = pdf_reader.metadata
pdf_name = meta.title if meta.title else 'Untitled'
pdf_name = sanitize_filename(pdf_name)
current_time = datetime.now().strftime("%Y%m%d_%H%M%S")
self.filename = f"{pdf_name}_{current_time}.json"
os.makedirs("./logs", exist_ok=True)
# Initialize empty list to store all messages
self.log_data = []
def log(self, level, message, **kwargs):
if isinstance(message, dict):
self.log_data.append(message)
else:
self.log_data.append({'message': message})
# Add new message to the log data
# Write entire log data to file
with open(self._filepath(), "w") as f:
json.dump(self.log_data, f, indent=2)
def info(self, message, **kwargs):
self.log("INFO", message, **kwargs)
def error(self, message, **kwargs):
self.log("ERROR", message, **kwargs)
def debug(self, message, **kwargs):
self.log("DEBUG", message, **kwargs)
def exception(self, message, **kwargs):
kwargs["exception"] = True
self.log("ERROR", message, **kwargs)
def _filepath(self):
return os.path.join("logs", self.filename)
def list_to_tree(data):
def get_parent_structure(structure):
"""Helper function to get the parent structure code"""
if not structure:
return None
parts = str(structure).split('.')
return '.'.join(parts[:-1]) if len(parts) > 1 else None
# First pass: Create nodes and track parent-child relationships
nodes = {}
root_nodes = []
for item in data:
structure = item.get('structure')
node = {
'title': item.get('title'),
'start_index': item.get('start_index'),
'end_index': item.get('end_index'),
'child_nodes': []
}
nodes[structure] = node
# Find parent
parent_structure = get_parent_structure(structure)
if parent_structure:
# Add as child to parent if parent exists
if parent_structure in nodes:
nodes[parent_structure]['child_nodes'].append(node)
else:
root_nodes.append(node)
else:
# No parent, this is a root node
root_nodes.append(node)
# Helper function to clean empty children arrays
def clean_node(node):
if not node['child_nodes']:
del node['child_nodes']
else:
for child in node['child_nodes']:
clean_node(child)
return node
# Clean and return the tree
return [clean_node(node) for node in root_nodes]
def add_preface_if_needed(data):
if not isinstance(data, list) or not data:
return data
if data[0]['physical_index'] is not None and data[0]['physical_index'] > 1:
preface_node = {
"structure": "0",
"title": "Preface",
"physical_index": 1,
}
data.insert(0, preface_node)
return data
def get_page_tokens(pdf_path, model="gpt-4o-2024-11-20", pdf_parser="PyPDF2"):
if pdf_parser == "PyPDF2":
pdf_reader = PyPDF2.PdfReader(pdf_path)
elif pdf_parser == "PyMuPDF":
pdf_reader = pymupdf.open(pdf_path)
else:
raise ValueError(f"Unsupported PDF parser: {pdf_parser}")
enc = tiktoken.encoding_for_model(model)
page_list = []
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
page_text = page.extract_text()
token_length = len(enc.encode(page_text))
page_list.append((page_text, token_length))
return page_list
def get_text_of_pdf_pages(pdf_pages, start_page, end_page):
text = ""
for page_num in range(start_page-1, end_page):
text += pdf_pages[page_num]
return text
def get_number_of_pages(pdf_path):
pdf_reader = PyPDF2.PdfReader(pdf_path)
num = len(pdf_reader.pages)
return num
def post_processing(structure, end_physical_index):
# First convert page_number to start_index in flat list
for i, item in enumerate(structure):
item['start_index'] = item.get('physical_index')
if i < len(structure) - 1:
if structure[i + 1].get('appear_start') == 'yes':
item['end_index'] = structure[i + 1]['physical_index']-1
else:
item['end_index'] = structure[i + 1]['physical_index']
else:
item['end_index'] = end_physical_index
tree = list_to_tree(structure)
if len(tree)!=0:
return tree
else:
### remove appear_start
for node in structure:
node.pop('appear_start', None)
node.pop('physical_index', None)
return structure
def clean_structure_post(data):
if isinstance(data, dict):
data.pop('page_number', None)
data.pop('start_index', None)
data.pop('end_index', None)
if 'child_nodes' in data:
clean_structure_post(data['child_nodes'])
elif isinstance(data, list):
for section in data:
clean_structure_post(section)
return data
def remove_structure_text(data):
if isinstance(data, dict):
data.pop('text', None)
if 'child_nodes' in data:
remove_structure_text(data['child_nodes'])
elif isinstance(data, list):
for item in data:
remove_structure_text(item)
return data
def check_token_limit(structure, limit=110000):
list = structure_to_list(structure)
for node in list:
num_tokens = count_tokens(node['text'], model='gpt-4o')
if num_tokens > limit:
print(f"Node ID: {node['node_id']} has {num_tokens} tokens")
print("Start Index:", node['start_index'])
print("End Index:", node['end_index'])
print("Title:", node['title'])
# print(node['text'])
print("\n")
def convert_physical_index_to_int(data):
if isinstance(data, list):
for i in range(len(data)):
if isinstance(data[i]['physical_index'], str):
if data[i]['physical_index'].startswith('<physical_index_'):
data[i]['physical_index'] = int(data[i]['physical_index'].split('_')[-1].rstrip('>').strip())
elif data[i]['physical_index'].startswith('physical_index_'):
data[i]['physical_index'] = int(data[i]['physical_index'].split('_')[-1].strip())
elif isinstance(data, str):
if data.startswith('<physical_index_'):
data = int(data.split('_')[-1].rstrip('>').strip())
elif data.startswith('physical_index_'):
data = int(data.split('_')[-1].strip())
###check data is int
if isinstance(data, int):
return data
else:
return None
return data
def convert_page_to_int(data):
for item in data:
if 'page' in item and isinstance(item['page'], str):
try:
item['page'] = int(item['page'])
except ValueError:
# Keep original value if conversion fails
pass
return data