mirror of
https://github.com/VectifyAI/PageIndex.git
synced 2026-04-24 23:56:21 +02:00
first commit
This commit is contained in:
commit
6f43b477d3
17 changed files with 4529 additions and 0 deletions
15
.gitignore
vendored
Normal file
15
.gitignore
vendored
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
.ipynb_checkpoints
|
||||
__pycache__
|
||||
files
|
||||
index
|
||||
temp/*
|
||||
chroma-collections.parquet
|
||||
chroma-embeddings.parquet
|
||||
.DS_Store
|
||||
.env*
|
||||
notebook
|
||||
SDK/*
|
||||
log/*
|
||||
logs/
|
||||
parts/*
|
||||
json_results/*
|
||||
21
LICENSE
Normal file
21
LICENSE
Normal file
|
|
@ -0,0 +1,21 @@
|
|||
MIT License
|
||||
|
||||
Copyright (c) 2025 Vectify AI
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
136
README.md
Normal file
136
README.md
Normal file
|
|
@ -0,0 +1,136 @@
|
|||
# PageIndex
|
||||
|
||||
### **Document Index System for Reasoning-Based RAG**
|
||||
|
||||
Traditional vector-based retrieval relies heavily on semantic similarity. But when working with professional documents that require domain expertise and multi-step reasoning, similarity search often falls short.
|
||||
|
||||
**Reasoning-Based RAG** offers a better alternative: enabling LLMs to *think* and *reason* their way to the most relevant document sections. Inspired by **AlphaGo**, we leverage **tree search** to perform structured document retrieval.
|
||||
|
||||
**PageIndex** is an indexing system that builds search trees from long documents, making them ready for reasoning-based RAG.
|
||||
|
||||
Built by [Vectify AI](https://vectify.ai/pageindex)
|
||||
|
||||
---
|
||||
|
||||
## 🔍 What is PageIndex?
|
||||
|
||||
**PageIndex** transforms lengthy PDF documents into a semantic **tree structure**, similar to a "table of contents" but optimized for use with Large Language Models (LLMs).
|
||||
It’s ideal for: financial reports, regulatory filings, academic textbooks, legal or technical manuals or any document that exceeds LLM context limits.
|
||||
|
||||
### ✅ Key Features
|
||||
|
||||
- **Scales to Massive Documents**
|
||||
Designed to handle hundreds or even thousands of pages with ease.
|
||||
|
||||
- **Hierarchical Tree Structure**
|
||||
Enables LLMs to traverse documents logically—like an intelligent, LLM-optimized table of contents.
|
||||
|
||||
- **Precise Page Referencing**
|
||||
Every node contains its own summary and start/end page physical index, allowing pinpoint retrieval.
|
||||
|
||||
- **Chunk-Free Segmentation**
|
||||
No arbitrary chunking. Nodes follow the natural structure of the document.
|
||||
|
||||
---
|
||||
|
||||
## 📦 PageIndex Format
|
||||
|
||||
Here is an example output. See more [example documents](https://github.com/VectifyAI/PageIndex/tree/main/docs) and [generated trees](https://github.com/VectifyAI/PageIndex/tree/main/results).
|
||||
|
||||
```json
|
||||
{
|
||||
"title": "Financial Stability",
|
||||
"node_id": "0006",
|
||||
"start_index": 21,
|
||||
"end_index": 22,
|
||||
"summary": "The Federal Reserve ...",
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Monitoring Financial Vulnerabilities",
|
||||
"node_id": "0007",
|
||||
"start_index": 22,
|
||||
"end_index": 28,
|
||||
"summary": "The Federal Reserve's monitoring ..."
|
||||
},
|
||||
{
|
||||
"title": "Domestic and International Cooperation and Coordination",
|
||||
"node_id": "0008",
|
||||
"start_index": 28,
|
||||
"end_index": 31,
|
||||
"summary": "In 2023, the Federal Reserve collaborated ..."
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
```
|
||||
Notice: the node_id and summary generation function will be added soon.
|
||||
|
||||
## 🧠 Reasoning-Based RAG with PageIndex
|
||||
|
||||
Use PageIndex to build **reasoning-based retrieval systems** without relying on semantic similarity. Great for domain-specific tasks where nuance matters.
|
||||
|
||||
### 🛠️ Example Prompt
|
||||
|
||||
```python
|
||||
prompt = f"""
|
||||
You are given a question and a tree structure of a document.
|
||||
You need to find all nodes that are likely to contain the answer.
|
||||
|
||||
Question: {question}
|
||||
|
||||
Document tree structure: {structure}
|
||||
|
||||
Reply in the following JSON format:
|
||||
{{
|
||||
"thinking": <reasoning about where to look>,
|
||||
"node_list": [node_id1, node_id2, ...]
|
||||
}}
|
||||
"""
|
||||
```
|
||||
|
||||
## 🚀 Usage
|
||||
|
||||
Follow these steps to generate a PageIndex tree from a PDF document.
|
||||
|
||||
### 1. Install dependencies
|
||||
|
||||
```bash
|
||||
pip3 install -r requirements.txt
|
||||
```
|
||||
|
||||
### 2. Set your OpenAI API key
|
||||
|
||||
Create a `.env` file in the root directory and add your API key:
|
||||
|
||||
```bash
|
||||
CHATGPT_API_KEY=your_openai_key_here
|
||||
```
|
||||
|
||||
### 3. Run PageIndex on your PDF
|
||||
|
||||
```bash
|
||||
python3 page_index.py --pdf_path /path/to/your/document.pdf
|
||||
```
|
||||
|
||||
The results will be saved in the `./results/` directory.
|
||||
|
||||
## 🛤 Roadmap
|
||||
|
||||
- [ ] Add node summary and document selection
|
||||
- [ ] Technical report on PageIndex design
|
||||
- [ ] Efficient tree search algorithms for large documents
|
||||
- [ ] Integration with vector-based semantic retrieval
|
||||
|
||||
## 📈 Case Study: Mafin 2.5
|
||||
|
||||
[Mafin 2.5](https://vectify.ai/blog/Mafin2.5) is a state-of-the-art reasoning-based RAG model designed specifically for financial document analysis. Built on top of **PageIndex**, it achieved an impressive **98.7% accuracy** on the [FinanceBench](https://github.com/VectifyAI/Mafin2.5-FinanceBench) benchmark—significantly outperforming traditional vector-based RAG systems.
|
||||
|
||||
PageIndex’s hierarchical indexing enabled precise navigation and extraction of relevant content from complex financial reports, such as SEC filings and earnings disclosures.
|
||||
|
||||
👉 See full [benchmark results](https://github.com/VectifyAI/Mafin2.5-FinanceBench) for detailed comparisons and performance metrics.
|
||||
|
||||
## 📬 Contact Us
|
||||
|
||||
Need customized support for your documents or reasoning-based RAG system?
|
||||
|
||||
👉 [Contact us here](https://ii2abc2jejf.typeform.com/to/meB40zV0)
|
||||
0
__init__.py
Normal file
0
__init__.py
Normal file
BIN
docs/2023-annual-report.pdf
Normal file
BIN
docs/2023-annual-report.pdf
Normal file
Binary file not shown.
BIN
docs/PRML.pdf
Normal file
BIN
docs/PRML.pdf
Normal file
Binary file not shown.
BIN
docs/Regulation Best Interest_Interpretive release.pdf
Normal file
BIN
docs/Regulation Best Interest_Interpretive release.pdf
Normal file
Binary file not shown.
BIN
docs/Regulation Best Interest_proposed rule.pdf
Normal file
BIN
docs/Regulation Best Interest_proposed rule.pdf
Normal file
Binary file not shown.
BIN
docs/q1-fy25-earnings.pdf
Normal file
BIN
docs/q1-fy25-earnings.pdf
Normal file
Binary file not shown.
1073
page_index.py
Normal file
1073
page_index.py
Normal file
File diff suppressed because it is too large
Load diff
5
requirements.txt
Normal file
5
requirements.txt
Normal file
|
|
@ -0,0 +1,5 @@
|
|||
openai==1.70.0
|
||||
pymupdf==1.25.5
|
||||
PyPDF2==3.0.1
|
||||
python-dotenv==1.1.0
|
||||
tiktoken==0.7.0
|
||||
460
results/2023-annual-report_structure.json
Normal file
460
results/2023-annual-report_structure.json
Normal file
|
|
@ -0,0 +1,460 @@
|
|||
[
|
||||
{
|
||||
"title": "Preface",
|
||||
"start_index": 1,
|
||||
"end_index": 4
|
||||
},
|
||||
{
|
||||
"title": "About the Federal Reserve",
|
||||
"start_index": 5,
|
||||
"end_index": 7
|
||||
},
|
||||
{
|
||||
"title": "Overview",
|
||||
"start_index": 7,
|
||||
"end_index": 8
|
||||
},
|
||||
{
|
||||
"title": "Monetary Policy and Economic Developments",
|
||||
"start_index": 9,
|
||||
"end_index": 9,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "March 2024 Summary",
|
||||
"start_index": 9,
|
||||
"end_index": 14
|
||||
},
|
||||
{
|
||||
"title": "June 2023 Summary",
|
||||
"start_index": 15,
|
||||
"end_index": 20
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Financial Stability",
|
||||
"start_index": 21,
|
||||
"end_index": 21,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Monitoring Financial Vulnerabilities",
|
||||
"start_index": 22,
|
||||
"end_index": 28
|
||||
},
|
||||
{
|
||||
"title": "Domestic and International Cooperation and Coordination",
|
||||
"start_index": 28,
|
||||
"end_index": 31
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Supervision and Regulation",
|
||||
"start_index": 31,
|
||||
"end_index": 31,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Supervised and Regulated Institutions",
|
||||
"start_index": 32,
|
||||
"end_index": 35
|
||||
},
|
||||
{
|
||||
"title": "Supervisory Developments",
|
||||
"start_index": 35,
|
||||
"end_index": 54
|
||||
},
|
||||
{
|
||||
"title": "Regulatory Developments",
|
||||
"start_index": 55,
|
||||
"end_index": 59
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Payment System and Reserve Bank Oversight",
|
||||
"start_index": 59,
|
||||
"end_index": 59,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Payment Services to Depository and Other Institutions",
|
||||
"start_index": 60,
|
||||
"end_index": 65
|
||||
},
|
||||
{
|
||||
"title": "Currency and Coin",
|
||||
"start_index": 66,
|
||||
"end_index": 68
|
||||
},
|
||||
{
|
||||
"title": "Fiscal Agency and Government Depository Services",
|
||||
"start_index": 69,
|
||||
"end_index": 72
|
||||
},
|
||||
{
|
||||
"title": "Evolutions and Improvements to the System",
|
||||
"start_index": 72,
|
||||
"end_index": 75
|
||||
},
|
||||
{
|
||||
"title": "Oversight of Federal Reserve Banks",
|
||||
"start_index": 75,
|
||||
"end_index": 81
|
||||
},
|
||||
{
|
||||
"title": "Pro Forma Financial Statements for Federal Reserve Priced Services",
|
||||
"start_index": 82,
|
||||
"end_index": 88
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Consumer and Community Affairs",
|
||||
"start_index": 89,
|
||||
"end_index": 89,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Consumer Compliance Supervision",
|
||||
"start_index": 89,
|
||||
"end_index": 101
|
||||
},
|
||||
{
|
||||
"title": "Consumer Laws and Regulations",
|
||||
"start_index": 101,
|
||||
"end_index": 102
|
||||
},
|
||||
{
|
||||
"title": "Consumer Research and Analysis of Emerging Issues and Policy",
|
||||
"start_index": 102,
|
||||
"end_index": 105
|
||||
},
|
||||
{
|
||||
"title": "Community Development",
|
||||
"start_index": 105,
|
||||
"end_index": 106
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Appendixes",
|
||||
"start_index": 107,
|
||||
"end_index": 108
|
||||
},
|
||||
{
|
||||
"title": "Federal Reserve System Organization",
|
||||
"start_index": 109,
|
||||
"end_index": 109,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Board of Governors",
|
||||
"start_index": 109,
|
||||
"end_index": 116
|
||||
},
|
||||
{
|
||||
"title": "Federal Open Market Committee",
|
||||
"start_index": 117,
|
||||
"end_index": 118
|
||||
},
|
||||
{
|
||||
"title": "Board of Governors Advisory Councils",
|
||||
"start_index": 119,
|
||||
"end_index": 122
|
||||
},
|
||||
{
|
||||
"title": "Federal Reserve Banks and Branches",
|
||||
"start_index": 123,
|
||||
"end_index": 146
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Minutes of Federal Open Market Committee Meetings",
|
||||
"start_index": 147,
|
||||
"end_index": 147,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Meeting Minutes",
|
||||
"start_index": 147,
|
||||
"end_index": 149
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Federal Reserve System Audits",
|
||||
"start_index": 149,
|
||||
"end_index": 149,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Office of Inspector General Activities",
|
||||
"start_index": 149,
|
||||
"end_index": 151
|
||||
},
|
||||
{
|
||||
"title": "Government Accountability Office Reviews",
|
||||
"start_index": 151,
|
||||
"end_index": 153
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Federal Reserve System Budgets",
|
||||
"start_index": 153,
|
||||
"end_index": 153,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "System Budgets Overview",
|
||||
"start_index": 153,
|
||||
"end_index": 157
|
||||
},
|
||||
{
|
||||
"title": "Board of Governors Budgets",
|
||||
"start_index": 157,
|
||||
"end_index": 163
|
||||
},
|
||||
{
|
||||
"title": "Federal Reserve Banks Budgets",
|
||||
"start_index": 163,
|
||||
"end_index": 169
|
||||
},
|
||||
{
|
||||
"title": "Currency Budget",
|
||||
"start_index": 169,
|
||||
"end_index": 174
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Record of Policy Actions of the Board of Governors",
|
||||
"start_index": 175,
|
||||
"end_index": 175,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Rules and Regulations",
|
||||
"start_index": 175,
|
||||
"end_index": 176
|
||||
},
|
||||
{
|
||||
"title": "Policy Statements and Other Actions",
|
||||
"start_index": 177,
|
||||
"end_index": 181
|
||||
},
|
||||
{
|
||||
"title": "Discount Rates for Depository Institutions in 2023",
|
||||
"start_index": 181,
|
||||
"end_index": 183
|
||||
},
|
||||
{
|
||||
"title": "The Board of Governors and the Government Performance and Results Act",
|
||||
"start_index": 184,
|
||||
"end_index": 184
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Litigation",
|
||||
"start_index": 185,
|
||||
"end_index": 185,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Pending",
|
||||
"start_index": 185,
|
||||
"end_index": 186
|
||||
},
|
||||
{
|
||||
"title": "Resolved",
|
||||
"start_index": 186,
|
||||
"end_index": 186
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Statistical Tables",
|
||||
"start_index": 187,
|
||||
"end_index": 187,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Federal Reserve open market transactions, 2023",
|
||||
"start_index": 187,
|
||||
"end_index": 187,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Federal Reserve open market transactions, 2023\u2014continued",
|
||||
"start_index": 187,
|
||||
"end_index": 188
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Federal Reserve Bank holdings of U.S. Treasury and federal agency securities, December 31, 2021\u201323",
|
||||
"start_index": 189,
|
||||
"end_index": 188,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Federal Reserve Bank holdings of U.S. Treasury and federal agency securities, December 31, 2021\u201323\u2014continued",
|
||||
"start_index": 189,
|
||||
"end_index": 190
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Reserve requirements of depository institutions, December 31, 2023",
|
||||
"start_index": 191,
|
||||
"end_index": 191
|
||||
},
|
||||
{
|
||||
"title": "Banking offices and banks affiliated with bank holding companies in the United States, December 31, 2022 and 2023",
|
||||
"start_index": 192,
|
||||
"end_index": 192
|
||||
},
|
||||
{
|
||||
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1984\u20132023 and month-end 2023",
|
||||
"start_index": 193,
|
||||
"end_index": 194,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1984\u20132023 and month-end 2023\u2014continued",
|
||||
"start_index": 194,
|
||||
"end_index": 194
|
||||
},
|
||||
{
|
||||
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1984\u20132023 and month-end 2023\u2014continued",
|
||||
"start_index": 195,
|
||||
"end_index": 196
|
||||
},
|
||||
{
|
||||
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1984\u20132023 and month-end 2023\u2014continued",
|
||||
"start_index": 196,
|
||||
"end_index": 196
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1918\u20131983",
|
||||
"start_index": 197,
|
||||
"end_index": 198,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1918\u20131983\u2014continued",
|
||||
"start_index": 199,
|
||||
"end_index": 198
|
||||
},
|
||||
{
|
||||
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1918\u20131983\u2014continued",
|
||||
"start_index": 199,
|
||||
"end_index": 198
|
||||
},
|
||||
{
|
||||
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1918\u20131983\u2014continued",
|
||||
"start_index": 199,
|
||||
"end_index": 200
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Principal assets and liabilities of insured commercial banks, by class of bank, June 30, 2023 and 2022",
|
||||
"start_index": 201,
|
||||
"end_index": 201
|
||||
},
|
||||
{
|
||||
"title": "Initial margin requirements under Regulations T, U, and X",
|
||||
"start_index": 202,
|
||||
"end_index": 203
|
||||
},
|
||||
{
|
||||
"title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022",
|
||||
"start_index": 203,
|
||||
"end_index": 206,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022\u2014continued",
|
||||
"start_index": 206,
|
||||
"end_index": 206
|
||||
},
|
||||
{
|
||||
"title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022\u2014continued",
|
||||
"start_index": 206,
|
||||
"end_index": 206
|
||||
},
|
||||
{
|
||||
"title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022\u2014continued",
|
||||
"start_index": 206,
|
||||
"end_index": 206
|
||||
},
|
||||
{
|
||||
"title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022\u2014continued",
|
||||
"start_index": 206,
|
||||
"end_index": 206
|
||||
},
|
||||
{
|
||||
"title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022\u2014continued",
|
||||
"start_index": 206,
|
||||
"end_index": 209
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Statement of condition of the Federal Reserve Banks, December 31, 2023 and 2022",
|
||||
"start_index": 209,
|
||||
"end_index": 210
|
||||
},
|
||||
{
|
||||
"title": "Income and expenses of the Federal Reserve Banks, by Bank, 2023",
|
||||
"start_index": 210,
|
||||
"end_index": 211,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Income and expenses of the Federal Reserve Banks, by Bank, 2023\u2014continued",
|
||||
"start_index": 211,
|
||||
"end_index": 212
|
||||
},
|
||||
{
|
||||
"title": "Income and expenses of the Federal Reserve Banks, by Bank, 2023\u2014continued",
|
||||
"start_index": 212,
|
||||
"end_index": 212
|
||||
},
|
||||
{
|
||||
"title": "Income and expenses of the Federal Reserve Banks, by Bank, 2023\u2014continued",
|
||||
"start_index": 212,
|
||||
"end_index": 214
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023",
|
||||
"start_index": 214,
|
||||
"end_index": 214,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023\u2014continued",
|
||||
"start_index": 214,
|
||||
"end_index": 214
|
||||
},
|
||||
{
|
||||
"title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023\u2014continued",
|
||||
"start_index": 214,
|
||||
"end_index": 217
|
||||
},
|
||||
{
|
||||
"title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023\u2014continued",
|
||||
"start_index": 217,
|
||||
"end_index": 217
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Operations in principal departments of the Federal Reserve Banks, 2020\u201323",
|
||||
"start_index": 218,
|
||||
"end_index": 218
|
||||
},
|
||||
{
|
||||
"title": "Number and annual salaries of officers and employees of the Federal Reserve Banks, December 31, 2023",
|
||||
"start_index": 219,
|
||||
"end_index": 220
|
||||
},
|
||||
{
|
||||
"title": "Acquisition costs and net book value of the premises of the Federal Reserve Banks and Branches, December 31, 2023",
|
||||
"start_index": 220,
|
||||
"end_index": 222
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
1558
results/PRML_structure.json
Normal file
1558
results/PRML_structure.json
Normal file
File diff suppressed because it is too large
Load diff
|
|
@ -0,0 +1,51 @@
|
|||
[
|
||||
{
|
||||
"title": "Preface",
|
||||
"start_index": 1,
|
||||
"end_index": 2
|
||||
},
|
||||
{
|
||||
"title": "Introduction",
|
||||
"start_index": 2,
|
||||
"end_index": 6
|
||||
},
|
||||
{
|
||||
"title": "Interpretation and Application",
|
||||
"start_index": 6,
|
||||
"end_index": 8,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Historical Context and Legislative History",
|
||||
"start_index": 8,
|
||||
"end_index": 10
|
||||
},
|
||||
{
|
||||
"title": "Scope of the Solely Incidental Prong of the Broker-Dealer Exclusion",
|
||||
"start_index": 10,
|
||||
"end_index": 14
|
||||
},
|
||||
{
|
||||
"title": "Guidance on Applying the Interpretation of the Solely Incidental Prong",
|
||||
"start_index": 14,
|
||||
"end_index": 22
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Economic Considerations",
|
||||
"start_index": 22,
|
||||
"end_index": 22,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Background",
|
||||
"start_index": 22,
|
||||
"end_index": 23
|
||||
},
|
||||
{
|
||||
"title": "Potential Economic Effects",
|
||||
"start_index": 23,
|
||||
"end_index": 28
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
466
results/Regulation Best Interest_proposed rule_structure.json
Normal file
466
results/Regulation Best Interest_proposed rule_structure.json
Normal file
|
|
@ -0,0 +1,466 @@
|
|||
[
|
||||
{
|
||||
"title": "Preface",
|
||||
"start_index": 1,
|
||||
"end_index": 6
|
||||
},
|
||||
{
|
||||
"title": "INTRODUCTION",
|
||||
"start_index": 6,
|
||||
"end_index": 12,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Background",
|
||||
"start_index": 12,
|
||||
"end_index": 22,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Evaluation of Standards of Conduct Applicable to Investment Advice",
|
||||
"start_index": 22,
|
||||
"end_index": 26
|
||||
},
|
||||
{
|
||||
"title": "DOL Rulemaking",
|
||||
"start_index": 26,
|
||||
"end_index": 32
|
||||
},
|
||||
{
|
||||
"title": "Statement by Chairman Clayton",
|
||||
"start_index": 32,
|
||||
"end_index": 36
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "General Objectives of Proposed Approach",
|
||||
"start_index": 36,
|
||||
"end_index": 44
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "DISCUSSION OF REGULATION BEST INTEREST",
|
||||
"start_index": 44,
|
||||
"end_index": 44,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Overview of Regulation Best Interest",
|
||||
"start_index": 44,
|
||||
"end_index": 50
|
||||
},
|
||||
{
|
||||
"title": "Best Interest, Generally",
|
||||
"start_index": 50,
|
||||
"end_index": 58,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Consistency with Other Approaches",
|
||||
"start_index": 58,
|
||||
"end_index": 66
|
||||
},
|
||||
{
|
||||
"title": "Request for Comment on the Best Interest Obligation",
|
||||
"start_index": 66,
|
||||
"end_index": 71
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Key Terms and Scope of Best Interest Obligation",
|
||||
"start_index": 71,
|
||||
"end_index": 71,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Natural Person who is an Associated Person",
|
||||
"start_index": 71,
|
||||
"end_index": 72
|
||||
},
|
||||
{
|
||||
"title": "When Making a Recommendation, At Time Recommendation is Made",
|
||||
"start_index": 72,
|
||||
"end_index": 82
|
||||
},
|
||||
{
|
||||
"title": "Any Securities Transaction or Investment Strategy",
|
||||
"start_index": 82,
|
||||
"end_index": 83
|
||||
},
|
||||
{
|
||||
"title": "Retail Customer",
|
||||
"start_index": 83,
|
||||
"end_index": 90
|
||||
},
|
||||
{
|
||||
"title": "Request for Comment on Key Terms and Scope of Best Interest Obligation",
|
||||
"start_index": 90,
|
||||
"end_index": 96
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Components of Regulation Best Interest",
|
||||
"start_index": 96,
|
||||
"end_index": 97,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Disclosure Obligation",
|
||||
"start_index": 97,
|
||||
"end_index": 133
|
||||
},
|
||||
{
|
||||
"title": "Care Obligation",
|
||||
"start_index": 133,
|
||||
"end_index": 166
|
||||
},
|
||||
{
|
||||
"title": "Conflict of Interest Obligations",
|
||||
"start_index": 166,
|
||||
"end_index": 196
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Recordkeeping and Retention",
|
||||
"start_index": 196,
|
||||
"end_index": 199
|
||||
},
|
||||
{
|
||||
"title": "Whether the Exercise of Investment Discretion Should be Viewed as Solely Incidental to the Business of a Broker or Dealer",
|
||||
"start_index": 199,
|
||||
"end_index": 209
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "REQUEST FOR COMMENT",
|
||||
"start_index": 209,
|
||||
"end_index": 210,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Generally",
|
||||
"start_index": 210,
|
||||
"end_index": 212
|
||||
},
|
||||
{
|
||||
"title": "Interactions with Other Standards of Conduct",
|
||||
"start_index": 212,
|
||||
"end_index": 214
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "ECONOMIC ANALYSIS",
|
||||
"start_index": 214,
|
||||
"end_index": 214,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Introduction, Primary Goals of Proposed Regulations and Broad Economic Considerations",
|
||||
"start_index": 214,
|
||||
"end_index": 214,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Introduction and Primary Goals of Proposed Regulation",
|
||||
"start_index": 214,
|
||||
"end_index": 215
|
||||
},
|
||||
{
|
||||
"title": "Broad Economic Considerations",
|
||||
"start_index": 215,
|
||||
"end_index": 225
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Economic Baseline",
|
||||
"start_index": 225,
|
||||
"end_index": 225,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Market for Advice Services",
|
||||
"start_index": 225,
|
||||
"end_index": 246
|
||||
},
|
||||
{
|
||||
"title": "Regulatory Baseline",
|
||||
"start_index": 246,
|
||||
"end_index": 255
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Benefits, Costs, and Effects on Efficiency, Competition, and Capital Formation",
|
||||
"start_index": 255,
|
||||
"end_index": 258,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Benefits",
|
||||
"start_index": 258,
|
||||
"end_index": 272
|
||||
},
|
||||
{
|
||||
"title": "Costs",
|
||||
"start_index": 272,
|
||||
"end_index": 275,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Standard of Conduct Defined as Best Interest",
|
||||
"start_index": 275,
|
||||
"end_index": 275,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Operational Costs",
|
||||
"start_index": 275,
|
||||
"end_index": 277
|
||||
},
|
||||
{
|
||||
"title": "Programmatic Costs",
|
||||
"start_index": 278,
|
||||
"end_index": 280
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Disclosure Obligation",
|
||||
"start_index": 280,
|
||||
"end_index": 286
|
||||
},
|
||||
{
|
||||
"title": "Obligation to Exercise Reasonable Diligence, Care, Skill, and Prudence in Making a Recommendation",
|
||||
"start_index": 286,
|
||||
"end_index": 290
|
||||
},
|
||||
{
|
||||
"title": "Obligation to Establish, Maintain, and Enforce Written Policies and Procedures Reasonably Designed to Identify and at a Minimum Disclose, or Eliminate, All Material Conflicts of Interest Associated with a Recommendation",
|
||||
"start_index": 290,
|
||||
"end_index": 295,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Eliminate Material Conflicts of Interest Associated with a Recommendation",
|
||||
"start_index": 295,
|
||||
"end_index": 297
|
||||
},
|
||||
{
|
||||
"title": "At a Minimum Disclose Material Conflicts of Interest Associated with a Recommendation",
|
||||
"start_index": 297,
|
||||
"end_index": 299
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Obligation to Establish, Maintain, and Enforce Written Policies and Procedures Reasonably Designed to Identify and Disclose and Mitigate, or Eliminate, Material Conflicts of Interest Arising from Financial Incentives Associated with a Recommendation",
|
||||
"start_index": 299,
|
||||
"end_index": 300,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Eliminate Material Conflicts Arising from Financial Incentives Associated with a Recommendation",
|
||||
"start_index": 300,
|
||||
"end_index": 304
|
||||
},
|
||||
{
|
||||
"title": "Disclose and Mitigate Material Conflicts of Interest Arising from Financial Incentives Associated with a Recommendation",
|
||||
"start_index": 304,
|
||||
"end_index": 316
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Effects on Efficiency, Competition, and Capital Formation",
|
||||
"start_index": 316,
|
||||
"end_index": 324
|
||||
},
|
||||
{
|
||||
"title": "Reasonable Alternatives",
|
||||
"start_index": 324,
|
||||
"end_index": 325,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Disclosure-Only Alternative",
|
||||
"start_index": 325,
|
||||
"end_index": 327
|
||||
},
|
||||
{
|
||||
"title": "Principles-Based Standard of Conduct Obligation",
|
||||
"start_index": 327,
|
||||
"end_index": 328
|
||||
},
|
||||
{
|
||||
"title": "A Fiduciary Standard for Broker-Dealers",
|
||||
"start_index": 328,
|
||||
"end_index": 332
|
||||
},
|
||||
{
|
||||
"title": "Enhanced Standards Akin to Conditions of the BIC Exemption",
|
||||
"start_index": 332,
|
||||
"end_index": 335
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Request for Comment",
|
||||
"start_index": 335,
|
||||
"end_index": 338
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "PAPERWORK REDUCTION ACT ANALYSIS",
|
||||
"start_index": 338,
|
||||
"end_index": 340,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Respondents Subject to Proposed Regulation Best Interest and Proposed Amendments to Rule 17a-3(a)(25), Rule 17a-4(e)(5)",
|
||||
"start_index": 340,
|
||||
"end_index": 340,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Broker-Dealers",
|
||||
"start_index": 340,
|
||||
"end_index": 340
|
||||
},
|
||||
{
|
||||
"title": "Natural Persons Who Are Associated Persons of Broker-Dealers",
|
||||
"start_index": 340,
|
||||
"end_index": 341
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Summary of Collections of Information",
|
||||
"start_index": 341,
|
||||
"end_index": 342,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Conflict of Interest Obligations",
|
||||
"start_index": 342,
|
||||
"end_index": 353
|
||||
},
|
||||
{
|
||||
"title": "Disclosure Obligation",
|
||||
"start_index": 353,
|
||||
"end_index": 370
|
||||
},
|
||||
{
|
||||
"title": "Care Obligation",
|
||||
"start_index": 370,
|
||||
"end_index": 370
|
||||
},
|
||||
{
|
||||
"title": "Record-Making and Recordkeeping Obligations",
|
||||
"start_index": 370,
|
||||
"end_index": 375
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Collection of Information is Mandatory",
|
||||
"start_index": 375,
|
||||
"end_index": 375
|
||||
},
|
||||
{
|
||||
"title": "Confidentiality",
|
||||
"start_index": 375,
|
||||
"end_index": 376
|
||||
},
|
||||
{
|
||||
"title": "Request for Comment",
|
||||
"start_index": 376,
|
||||
"end_index": 377
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "SMALL BUSINESS REGULATORY ENFORCEMENT FAIRNESS ACT",
|
||||
"start_index": 377,
|
||||
"end_index": 378
|
||||
},
|
||||
{
|
||||
"title": "INITIAL REGULATORY FLEXIBILITY ACT ANALYSIS",
|
||||
"start_index": 378,
|
||||
"end_index": 379,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Reasons for and Objectives of the Proposed Action",
|
||||
"start_index": 379,
|
||||
"end_index": 381
|
||||
},
|
||||
{
|
||||
"title": "Legal Basis",
|
||||
"start_index": 381,
|
||||
"end_index": 381
|
||||
},
|
||||
{
|
||||
"title": "Small Entities Subject to the Proposed Rule",
|
||||
"start_index": 381,
|
||||
"end_index": 382
|
||||
},
|
||||
{
|
||||
"title": "Projected Compliance Requirements of the Proposed Rule for Small Entities",
|
||||
"start_index": 382,
|
||||
"end_index": 383,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Conflict of Interest Obligations",
|
||||
"start_index": 383,
|
||||
"end_index": 386
|
||||
},
|
||||
{
|
||||
"title": "Disclosure Obligations",
|
||||
"start_index": 387,
|
||||
"end_index": 394
|
||||
},
|
||||
{
|
||||
"title": "Obligation to Exercise Reasonable Diligence, Care, Skill and Prudence",
|
||||
"start_index": 394,
|
||||
"end_index": 394
|
||||
},
|
||||
{
|
||||
"title": "Record-Making and Recordkeeping Obligations",
|
||||
"start_index": 394,
|
||||
"end_index": 397
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Duplicative, Overlapping, or Conflicting Federal Rules",
|
||||
"start_index": 397,
|
||||
"end_index": 398
|
||||
},
|
||||
{
|
||||
"title": "Significant Alternatives",
|
||||
"start_index": 398,
|
||||
"end_index": 401,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Disclosure-Only Alternative",
|
||||
"start_index": 401,
|
||||
"end_index": 401
|
||||
},
|
||||
{
|
||||
"title": "Principles-Based Alternative",
|
||||
"start_index": 401,
|
||||
"end_index": 402
|
||||
},
|
||||
{
|
||||
"title": "Enhanced Standards Akin to BIC Exemption",
|
||||
"start_index": 402,
|
||||
"end_index": 403
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "General Request for Comment",
|
||||
"start_index": 403,
|
||||
"end_index": 403
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "STATUTORY AUTHORITY AND TEXT OF PROPOSED RULE",
|
||||
"start_index": 403,
|
||||
"end_index": 408
|
||||
}
|
||||
]
|
||||
220
results/q1-fy25-earnings_structure.json
Normal file
220
results/q1-fy25-earnings_structure.json
Normal file
|
|
@ -0,0 +1,220 @@
|
|||
[
|
||||
{
|
||||
"title": "THE WALT DISNEY COMPANY REPORTS FIRST QUARTER EARNINGS FOR FISCAL 2025",
|
||||
"start_index": 1,
|
||||
"end_index": 1,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Financial Results for the Quarter",
|
||||
"start_index": 1,
|
||||
"end_index": 1,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Key Points",
|
||||
"start_index": 1,
|
||||
"end_index": 1
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Guidance and Outlook",
|
||||
"start_index": 2,
|
||||
"end_index": 2,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Star India deconsolidated in Q1",
|
||||
"start_index": 2,
|
||||
"end_index": 2
|
||||
},
|
||||
{
|
||||
"title": "Q2 Fiscal 2025",
|
||||
"start_index": 2,
|
||||
"end_index": 2
|
||||
},
|
||||
{
|
||||
"title": "Fiscal Year 2025",
|
||||
"start_index": 2,
|
||||
"end_index": 2
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Message From Our CEO",
|
||||
"start_index": 2,
|
||||
"end_index": 2
|
||||
},
|
||||
{
|
||||
"title": "SUMMARIZED FINANCIAL RESULTS",
|
||||
"start_index": 3,
|
||||
"end_index": 3,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "SUMMARIZED SEGMENT FINANCIAL RESULTS",
|
||||
"start_index": 3,
|
||||
"end_index": 3
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "DISCUSSION OF FIRST QUARTER SEGMENT RESULTS",
|
||||
"start_index": 4,
|
||||
"end_index": 4,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Star India",
|
||||
"start_index": 4,
|
||||
"end_index": 4
|
||||
},
|
||||
{
|
||||
"title": "Entertainment",
|
||||
"start_index": 4,
|
||||
"end_index": 4,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Linear Networks",
|
||||
"start_index": 5,
|
||||
"end_index": 5
|
||||
},
|
||||
{
|
||||
"title": "Direct-to-Consumer",
|
||||
"start_index": 5,
|
||||
"end_index": 7
|
||||
},
|
||||
{
|
||||
"title": "Content Sales/Licensing and Other",
|
||||
"start_index": 7,
|
||||
"end_index": 7
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Sports",
|
||||
"start_index": 7,
|
||||
"end_index": 7,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Domestic ESPN",
|
||||
"start_index": 8,
|
||||
"end_index": 8
|
||||
},
|
||||
{
|
||||
"title": "International ESPN",
|
||||
"start_index": 8,
|
||||
"end_index": 8
|
||||
},
|
||||
{
|
||||
"title": "Star India",
|
||||
"start_index": 8,
|
||||
"end_index": 8
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Experiences",
|
||||
"start_index": 9,
|
||||
"end_index": 9,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Domestic Parks and Experiences",
|
||||
"start_index": 9,
|
||||
"end_index": 9
|
||||
},
|
||||
{
|
||||
"title": "International Parks and Experiences",
|
||||
"start_index": 9,
|
||||
"end_index": 9
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "OTHER FINANCIAL INFORMATION",
|
||||
"start_index": 9,
|
||||
"end_index": 9,
|
||||
"child_nodes": [
|
||||
{
|
||||
"title": "Corporate and Unallocated Shared Expenses",
|
||||
"start_index": 9,
|
||||
"end_index": 9
|
||||
},
|
||||
{
|
||||
"title": "Restructuring and Impairment Charges",
|
||||
"start_index": 9,
|
||||
"end_index": 9
|
||||
},
|
||||
{
|
||||
"title": "Interest Expense, net",
|
||||
"start_index": 10,
|
||||
"end_index": 10
|
||||
},
|
||||
{
|
||||
"title": "Equity in the Income of Investees",
|
||||
"start_index": 10,
|
||||
"end_index": 10
|
||||
},
|
||||
{
|
||||
"title": "Income Taxes",
|
||||
"start_index": 10,
|
||||
"end_index": 10
|
||||
},
|
||||
{
|
||||
"title": "Noncontrolling Interests",
|
||||
"start_index": 11,
|
||||
"end_index": 11
|
||||
},
|
||||
{
|
||||
"title": "Cash from Operations",
|
||||
"start_index": 11,
|
||||
"end_index": 11
|
||||
},
|
||||
{
|
||||
"title": "Capital Expenditures",
|
||||
"start_index": 12,
|
||||
"end_index": 12
|
||||
},
|
||||
{
|
||||
"title": "Depreciation Expense",
|
||||
"start_index": 12,
|
||||
"end_index": 12
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "THE WALT DISNEY COMPANY CONDENSED CONSOLIDATED STATEMENTS OF INCOME",
|
||||
"start_index": 13,
|
||||
"end_index": 13
|
||||
},
|
||||
{
|
||||
"title": "THE WALT DISNEY COMPANY CONDENSED CONSOLIDATED BALANCE SHEETS",
|
||||
"start_index": 14,
|
||||
"end_index": 14
|
||||
},
|
||||
{
|
||||
"title": "THE WALT DISNEY COMPANY CONDENSED CONSOLIDATED STATEMENTS OF CASH FLOWS",
|
||||
"start_index": 15,
|
||||
"end_index": 15
|
||||
},
|
||||
{
|
||||
"title": "DTC PRODUCT DESCRIPTIONS AND KEY DEFINITIONS",
|
||||
"start_index": 16,
|
||||
"end_index": 16
|
||||
},
|
||||
{
|
||||
"title": "NON-GAAP FINANCIAL MEASURES",
|
||||
"start_index": 17,
|
||||
"end_index": 20
|
||||
},
|
||||
{
|
||||
"title": "FORWARD-LOOKING STATEMENTS",
|
||||
"start_index": 21,
|
||||
"end_index": 21
|
||||
},
|
||||
{
|
||||
"title": "PREPARED EARNINGS REMARKS AND CONFERENCE CALL INFORMATION",
|
||||
"start_index": 22,
|
||||
"end_index": 22
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
524
utils.py
Normal file
524
utils.py
Normal file
|
|
@ -0,0 +1,524 @@
|
|||
import tiktoken
|
||||
import openai
|
||||
import logging
|
||||
import os
|
||||
from datetime import datetime
|
||||
import time
|
||||
import json
|
||||
import PyPDF2
|
||||
import copy
|
||||
import asyncio
|
||||
import pymupdf
|
||||
from io import BytesIO
|
||||
import logging
|
||||
|
||||
|
||||
def count_tokens(text, model):
|
||||
enc = tiktoken.encoding_for_model(model)
|
||||
tokens = enc.encode(text)
|
||||
return len(tokens)
|
||||
|
||||
def ChatGPT_API_with_finish_reason(model, prompt, api_key, chat_history=None):
|
||||
max_retries = 10
|
||||
client = openai.OpenAI(api_key=api_key)
|
||||
for i in range(max_retries):
|
||||
try:
|
||||
if chat_history:
|
||||
messages = chat_history
|
||||
messages.append({"role": "user", "content": prompt})
|
||||
else:
|
||||
messages = [{"role": "user", "content": prompt}]
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model=model,
|
||||
messages=messages,
|
||||
temperature=0,
|
||||
)
|
||||
if response.choices[0].finish_reason == "length":
|
||||
return response.choices[0].message.content, "max_output_reached"
|
||||
else:
|
||||
return response.choices[0].message.content, "finished"
|
||||
|
||||
except Exception as e:
|
||||
print('************* Retrying *************')
|
||||
logging.error(f"Error: {e}")
|
||||
if i < max_retries - 1:
|
||||
time.sleep(1) # Wait for 1秒 before retrying
|
||||
else:
|
||||
logging.error('Max retries reached for prompt: ' + prompt)
|
||||
return "Error"
|
||||
|
||||
|
||||
|
||||
def ChatGPT_API(model, prompt, api_key, chat_history=None):
|
||||
max_retries = 10
|
||||
client = openai.OpenAI(api_key=api_key)
|
||||
for i in range(max_retries):
|
||||
try:
|
||||
if chat_history:
|
||||
messages = chat_history
|
||||
messages.append({"role": "user", "content": prompt})
|
||||
else:
|
||||
messages = [{"role": "user", "content": prompt}]
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model=model,
|
||||
messages=messages,
|
||||
temperature=0,
|
||||
)
|
||||
|
||||
return response.choices[0].message.content
|
||||
except Exception as e:
|
||||
print('************* Retrying *************')
|
||||
logging.error(f"Error: {e}")
|
||||
if i < max_retries - 1:
|
||||
time.sleep(1) # Wait for 1秒 before retrying
|
||||
else:
|
||||
logging.error('Max retries reached for prompt: ' + prompt)
|
||||
return "Error"
|
||||
|
||||
|
||||
async def ChatGPT_API_async(model, prompt, api_key):
|
||||
max_retries = 10
|
||||
client = openai.AsyncOpenAI(api_key=api_key)
|
||||
for i in range(max_retries):
|
||||
try:
|
||||
messages = [{"role": "user", "content": prompt}]
|
||||
response = await client.chat.completions.create(
|
||||
model=model,
|
||||
messages=messages,
|
||||
temperature=0,
|
||||
)
|
||||
return response.choices[0].message.content
|
||||
except Exception as e:
|
||||
print('************* Retrying *************')
|
||||
logging.error(f"Error: {e}")
|
||||
if i < max_retries - 1:
|
||||
await asyncio.sleep(1) # Wait for 1秒 before retrying
|
||||
else:
|
||||
logging.error('Max retries reached for prompt: ' + prompt)
|
||||
return "Error"
|
||||
|
||||
def get_json_content(response):
|
||||
start_idx = response.find("```json")
|
||||
if start_idx != -1:
|
||||
start_idx += 7
|
||||
response = response[start_idx:]
|
||||
|
||||
end_idx = response.rfind("```")
|
||||
if end_idx != -1:
|
||||
response = response[:end_idx]
|
||||
|
||||
json_content = response.strip()
|
||||
return json_content
|
||||
|
||||
|
||||
def extract_json(content):
|
||||
try:
|
||||
# First, try to extract JSON enclosed within ```json and ```
|
||||
start_idx = content.find("```json")
|
||||
if start_idx != -1:
|
||||
start_idx += 7 # Adjust index to start after the delimiter
|
||||
end_idx = content.rfind("```")
|
||||
json_content = content[start_idx:end_idx].strip()
|
||||
else:
|
||||
# If no delimiters, assume entire content could be JSON
|
||||
json_content = content.strip()
|
||||
|
||||
# Clean up common issues that might cause parsing errors
|
||||
json_content = json_content.replace('None', 'null') # Replace Python None with JSON null
|
||||
json_content = json_content.replace('\n', ' ').replace('\r', ' ') # Remove newlines
|
||||
json_content = ' '.join(json_content.split()) # Normalize whitespace
|
||||
|
||||
# Attempt to parse and return the JSON object
|
||||
return json.loads(json_content)
|
||||
except json.JSONDecodeError as e:
|
||||
logging.error(f"Failed to extract JSON: {e}")
|
||||
# Try to clean up the content further if initial parsing fails
|
||||
try:
|
||||
# Remove any trailing commas before closing brackets/braces
|
||||
json_content = json_content.replace(',]', ']').replace(',}', '}')
|
||||
return json.loads(json_content)
|
||||
except:
|
||||
logging.error("Failed to parse JSON even after cleanup")
|
||||
return {}
|
||||
except Exception as e:
|
||||
logging.error(f"Unexpected error while extracting JSON: {e}")
|
||||
return {}
|
||||
|
||||
def write_node_id(data, node_id=0):
|
||||
if isinstance(data, dict):
|
||||
data['node_id'] = str(node_id).zfill(4)
|
||||
node_id += 1
|
||||
for key in list(data.keys()):
|
||||
if 'child_nodes' in key:
|
||||
node_id = write_node_id(data[key], node_id)
|
||||
elif isinstance(data, list):
|
||||
for index in range(len(data)):
|
||||
node_id = write_node_id(data[index], node_id)
|
||||
return node_id
|
||||
|
||||
def get_nodes(structure):
|
||||
if isinstance(structure, dict):
|
||||
structure_node = copy.deepcopy(structure)
|
||||
structure_node.pop('child_nodes', None)
|
||||
nodes = [structure_node]
|
||||
for key in list(structure.keys()):
|
||||
if 'child_nodes' in key:
|
||||
nodes.extend(get_nodes(structure[key]))
|
||||
return nodes
|
||||
elif isinstance(structure, list):
|
||||
nodes = []
|
||||
for item in structure:
|
||||
nodes.extend(get_nodes(item))
|
||||
return nodes
|
||||
|
||||
def structure_to_list(structure):
|
||||
if isinstance(structure, dict):
|
||||
nodes = []
|
||||
nodes.append(structure)
|
||||
if 'child_nodes' in structure:
|
||||
nodes.extend(structure_to_list(structure['child_nodes']))
|
||||
return nodes
|
||||
elif isinstance(structure, list):
|
||||
nodes = []
|
||||
for item in structure:
|
||||
nodes.extend(structure_to_list(item))
|
||||
return nodes
|
||||
|
||||
|
||||
def get_leaf_nodes(structure):
|
||||
if isinstance(structure, dict):
|
||||
if not structure['child_nodes']:
|
||||
structure_node = copy.deepcopy(structure)
|
||||
structure_node.pop('child_nodes', None)
|
||||
return [structure_node]
|
||||
else:
|
||||
leaf_nodes = []
|
||||
for key in list(structure.keys()):
|
||||
if 'child_nodes' in key:
|
||||
leaf_nodes.extend(get_leaf_nodes(structure[key]))
|
||||
return leaf_nodes
|
||||
elif isinstance(structure, list):
|
||||
leaf_nodes = []
|
||||
for item in structure:
|
||||
leaf_nodes.extend(get_leaf_nodes(item))
|
||||
return leaf_nodes
|
||||
|
||||
def is_leaf_node(data, node_id):
|
||||
# Helper function to find the node by its node_id
|
||||
def find_node(data, node_id):
|
||||
if isinstance(data, dict):
|
||||
if data.get('node_id') == node_id:
|
||||
return data
|
||||
for key in data.keys():
|
||||
if 'child_nodes' in key:
|
||||
result = find_node(data[key], node_id)
|
||||
if result:
|
||||
return result
|
||||
elif isinstance(data, list):
|
||||
for item in data:
|
||||
result = find_node(item, node_id)
|
||||
if result:
|
||||
return result
|
||||
return None
|
||||
|
||||
# Find the node with the given node_id
|
||||
node = find_node(data, node_id)
|
||||
|
||||
# Check if the node is a leaf node
|
||||
if node and not node.get('child_nodes'):
|
||||
return True
|
||||
return False
|
||||
|
||||
def get_last_node(structure):
|
||||
return structure[-1]
|
||||
|
||||
|
||||
def extract_text_from_pdf(pdf_path):
|
||||
pdf_reader = PyPDF2.PdfReader(pdf_path)
|
||||
###return text not list
|
||||
text=""
|
||||
for page_num in range(len(pdf_reader.pages)):
|
||||
page = pdf_reader.pages[page_num]
|
||||
text+=page.extract_text()
|
||||
return text
|
||||
|
||||
def get_pdf_title(pdf_path):
|
||||
pdf_reader = PyPDF2.PdfReader(pdf_path)
|
||||
meta = pdf_reader.metadata
|
||||
title = meta.title
|
||||
return title
|
||||
|
||||
def get_text_of_pages(pdf_path, start_page, end_page, tag=True):
|
||||
pdf_reader = PyPDF2.PdfReader(pdf_path)
|
||||
text = ""
|
||||
for page_num in range(start_page-1, end_page):
|
||||
page = pdf_reader.pages[page_num]
|
||||
page_text = page.extract_text()
|
||||
if tag:
|
||||
text += f"<start_index_{page_num+1}>\n{page_text}\n<end_index_{page_num+1}>\n"
|
||||
else:
|
||||
text += page_text
|
||||
return text
|
||||
|
||||
def get_first_start_page_from_text(text):
|
||||
start_page = -1
|
||||
start_page_match = re.search(r'<start_index_(\d+)>', text)
|
||||
if start_page_match:
|
||||
start_page = int(start_page_match.group(1))
|
||||
return start_page
|
||||
|
||||
def get_last_start_page_from_text(text):
|
||||
start_page = -1
|
||||
# Find all matches of start_index tags
|
||||
start_page_matches = re.finditer(r'<start_index_(\d+)>', text)
|
||||
# Convert iterator to list and get the last match if any exist
|
||||
matches_list = list(start_page_matches)
|
||||
if matches_list:
|
||||
start_page = int(matches_list[-1].group(1))
|
||||
return start_page
|
||||
|
||||
|
||||
|
||||
|
||||
def sanitize_filename(filename, replacement='-'):
|
||||
# In Linux, only '/' and '\0' (null) are invalid in filenames.
|
||||
# Null can't be represented in strings, so we only handle '/'.
|
||||
return filename.replace('/', replacement)
|
||||
|
||||
class JsonLogger:
|
||||
def __init__(self, file_path):
|
||||
# Extract PDF name without extension for logger name and filename
|
||||
# pdf_name = os.path.splitext(os.path.basename(file_path))[0]
|
||||
if isinstance(file_path, str):
|
||||
pdf_name = os.path.splitext(os.path.basename(file_path))[0]
|
||||
elif isinstance(file_path, BytesIO):
|
||||
pdf_reader = PyPDF2.PdfReader(file_path)
|
||||
meta = pdf_reader.metadata
|
||||
pdf_name = meta.title if meta.title else 'Untitled'
|
||||
pdf_name = sanitize_filename(pdf_name)
|
||||
|
||||
current_time = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
self.filename = f"{pdf_name}_{current_time}.json"
|
||||
os.makedirs("./logs", exist_ok=True)
|
||||
# Initialize empty list to store all messages
|
||||
self.log_data = []
|
||||
|
||||
def log(self, level, message, **kwargs):
|
||||
if isinstance(message, dict):
|
||||
self.log_data.append(message)
|
||||
else:
|
||||
self.log_data.append({'message': message})
|
||||
# Add new message to the log data
|
||||
|
||||
# Write entire log data to file
|
||||
with open(self._filepath(), "w") as f:
|
||||
json.dump(self.log_data, f, indent=2)
|
||||
|
||||
def info(self, message, **kwargs):
|
||||
self.log("INFO", message, **kwargs)
|
||||
|
||||
def error(self, message, **kwargs):
|
||||
self.log("ERROR", message, **kwargs)
|
||||
|
||||
def debug(self, message, **kwargs):
|
||||
self.log("DEBUG", message, **kwargs)
|
||||
|
||||
def exception(self, message, **kwargs):
|
||||
kwargs["exception"] = True
|
||||
self.log("ERROR", message, **kwargs)
|
||||
|
||||
def _filepath(self):
|
||||
return os.path.join("logs", self.filename)
|
||||
|
||||
|
||||
|
||||
|
||||
def list_to_tree(data):
|
||||
def get_parent_structure(structure):
|
||||
"""Helper function to get the parent structure code"""
|
||||
if not structure:
|
||||
return None
|
||||
parts = str(structure).split('.')
|
||||
return '.'.join(parts[:-1]) if len(parts) > 1 else None
|
||||
|
||||
# First pass: Create nodes and track parent-child relationships
|
||||
nodes = {}
|
||||
root_nodes = []
|
||||
|
||||
for item in data:
|
||||
structure = item.get('structure')
|
||||
node = {
|
||||
'title': item.get('title'),
|
||||
'start_index': item.get('start_index'),
|
||||
'end_index': item.get('end_index'),
|
||||
'child_nodes': []
|
||||
}
|
||||
|
||||
nodes[structure] = node
|
||||
|
||||
# Find parent
|
||||
parent_structure = get_parent_structure(structure)
|
||||
|
||||
if parent_structure:
|
||||
# Add as child to parent if parent exists
|
||||
if parent_structure in nodes:
|
||||
nodes[parent_structure]['child_nodes'].append(node)
|
||||
else:
|
||||
root_nodes.append(node)
|
||||
else:
|
||||
# No parent, this is a root node
|
||||
root_nodes.append(node)
|
||||
|
||||
# Helper function to clean empty children arrays
|
||||
def clean_node(node):
|
||||
if not node['child_nodes']:
|
||||
del node['child_nodes']
|
||||
else:
|
||||
for child in node['child_nodes']:
|
||||
clean_node(child)
|
||||
return node
|
||||
|
||||
# Clean and return the tree
|
||||
return [clean_node(node) for node in root_nodes]
|
||||
|
||||
def add_preface_if_needed(data):
|
||||
if not isinstance(data, list) or not data:
|
||||
return data
|
||||
|
||||
if data[0]['physical_index'] is not None and data[0]['physical_index'] > 1:
|
||||
preface_node = {
|
||||
"structure": "0",
|
||||
"title": "Preface",
|
||||
"physical_index": 1,
|
||||
}
|
||||
data.insert(0, preface_node)
|
||||
return data
|
||||
|
||||
|
||||
|
||||
def get_page_tokens(pdf_path, model="gpt-4o-2024-11-20", pdf_parser="PyPDF2"):
|
||||
if pdf_parser == "PyPDF2":
|
||||
pdf_reader = PyPDF2.PdfReader(pdf_path)
|
||||
elif pdf_parser == "PyMuPDF":
|
||||
pdf_reader = pymupdf.open(pdf_path)
|
||||
else:
|
||||
raise ValueError(f"Unsupported PDF parser: {pdf_parser}")
|
||||
|
||||
enc = tiktoken.encoding_for_model(model)
|
||||
|
||||
page_list = []
|
||||
for page_num in range(len(pdf_reader.pages)):
|
||||
page = pdf_reader.pages[page_num]
|
||||
page_text = page.extract_text()
|
||||
token_length = len(enc.encode(page_text))
|
||||
page_list.append((page_text, token_length))
|
||||
|
||||
return page_list
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
def get_text_of_pdf_pages(pdf_pages, start_page, end_page):
|
||||
text = ""
|
||||
for page_num in range(start_page-1, end_page):
|
||||
text += pdf_pages[page_num]
|
||||
return text
|
||||
|
||||
def get_number_of_pages(pdf_path):
|
||||
pdf_reader = PyPDF2.PdfReader(pdf_path)
|
||||
num = len(pdf_reader.pages)
|
||||
return num
|
||||
|
||||
|
||||
|
||||
def post_processing(structure, end_physical_index):
|
||||
# First convert page_number to start_index in flat list
|
||||
for i, item in enumerate(structure):
|
||||
item['start_index'] = item.get('physical_index')
|
||||
if i < len(structure) - 1:
|
||||
if structure[i + 1].get('appear_start') == 'yes':
|
||||
item['end_index'] = structure[i + 1]['physical_index']-1
|
||||
else:
|
||||
item['end_index'] = structure[i + 1]['physical_index']
|
||||
else:
|
||||
item['end_index'] = end_physical_index
|
||||
tree = list_to_tree(structure)
|
||||
if len(tree)!=0:
|
||||
return tree
|
||||
else:
|
||||
### remove appear_start
|
||||
for node in structure:
|
||||
node.pop('appear_start', None)
|
||||
node.pop('physical_index', None)
|
||||
return structure
|
||||
|
||||
def clean_structure_post(data):
|
||||
if isinstance(data, dict):
|
||||
data.pop('page_number', None)
|
||||
data.pop('start_index', None)
|
||||
data.pop('end_index', None)
|
||||
if 'child_nodes' in data:
|
||||
clean_structure_post(data['child_nodes'])
|
||||
elif isinstance(data, list):
|
||||
for section in data:
|
||||
clean_structure_post(section)
|
||||
return data
|
||||
|
||||
|
||||
def remove_structure_text(data):
|
||||
if isinstance(data, dict):
|
||||
data.pop('text', None)
|
||||
if 'child_nodes' in data:
|
||||
remove_structure_text(data['child_nodes'])
|
||||
elif isinstance(data, list):
|
||||
for item in data:
|
||||
remove_structure_text(item)
|
||||
return data
|
||||
|
||||
|
||||
def check_token_limit(structure, limit=110000):
|
||||
list = structure_to_list(structure)
|
||||
for node in list:
|
||||
num_tokens = count_tokens(node['text'], model='gpt-4o')
|
||||
if num_tokens > limit:
|
||||
print(f"Node ID: {node['node_id']} has {num_tokens} tokens")
|
||||
print("Start Index:", node['start_index'])
|
||||
print("End Index:", node['end_index'])
|
||||
print("Title:", node['title'])
|
||||
# print(node['text'])
|
||||
print("\n")
|
||||
|
||||
|
||||
def convert_physical_index_to_int(data):
|
||||
if isinstance(data, list):
|
||||
for i in range(len(data)):
|
||||
if isinstance(data[i]['physical_index'], str):
|
||||
if data[i]['physical_index'].startswith('<physical_index_'):
|
||||
data[i]['physical_index'] = int(data[i]['physical_index'].split('_')[-1].rstrip('>').strip())
|
||||
elif data[i]['physical_index'].startswith('physical_index_'):
|
||||
data[i]['physical_index'] = int(data[i]['physical_index'].split('_')[-1].strip())
|
||||
elif isinstance(data, str):
|
||||
if data.startswith('<physical_index_'):
|
||||
data = int(data.split('_')[-1].rstrip('>').strip())
|
||||
elif data.startswith('physical_index_'):
|
||||
data = int(data.split('_')[-1].strip())
|
||||
###check data is int
|
||||
if isinstance(data, int):
|
||||
return data
|
||||
else:
|
||||
return None
|
||||
return data
|
||||
|
||||
|
||||
def convert_page_to_int(data):
|
||||
for item in data:
|
||||
if 'page' in item and isinstance(item['page'], str):
|
||||
try:
|
||||
item['page'] = int(item['page'])
|
||||
except ValueError:
|
||||
# Keep original value if conversion fails
|
||||
pass
|
||||
return data
|
||||
Loading…
Add table
Add a link
Reference in a new issue