mirror of
https://github.com/VectifyAI/PageIndex.git
synced 2026-04-27 17:16:22 +02:00
first commit
This commit is contained in:
commit
6f43b477d3
17 changed files with 4529 additions and 0 deletions
15
.gitignore
vendored
Normal file
15
.gitignore
vendored
Normal file
|
|
@ -0,0 +1,15 @@
|
||||||
|
.ipynb_checkpoints
|
||||||
|
__pycache__
|
||||||
|
files
|
||||||
|
index
|
||||||
|
temp/*
|
||||||
|
chroma-collections.parquet
|
||||||
|
chroma-embeddings.parquet
|
||||||
|
.DS_Store
|
||||||
|
.env*
|
||||||
|
notebook
|
||||||
|
SDK/*
|
||||||
|
log/*
|
||||||
|
logs/
|
||||||
|
parts/*
|
||||||
|
json_results/*
|
||||||
21
LICENSE
Normal file
21
LICENSE
Normal file
|
|
@ -0,0 +1,21 @@
|
||||||
|
MIT License
|
||||||
|
|
||||||
|
Copyright (c) 2025 Vectify AI
|
||||||
|
|
||||||
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||||
|
of this software and associated documentation files (the "Software"), to deal
|
||||||
|
in the Software without restriction, including without limitation the rights
|
||||||
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||||
|
copies of the Software, and to permit persons to whom the Software is
|
||||||
|
furnished to do so, subject to the following conditions:
|
||||||
|
|
||||||
|
The above copyright notice and this permission notice shall be included in all
|
||||||
|
copies or substantial portions of the Software.
|
||||||
|
|
||||||
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||||
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||||
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||||
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||||
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||||
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||||
|
SOFTWARE.
|
||||||
136
README.md
Normal file
136
README.md
Normal file
|
|
@ -0,0 +1,136 @@
|
||||||
|
# PageIndex
|
||||||
|
|
||||||
|
### **Document Index System for Reasoning-Based RAG**
|
||||||
|
|
||||||
|
Traditional vector-based retrieval relies heavily on semantic similarity. But when working with professional documents that require domain expertise and multi-step reasoning, similarity search often falls short.
|
||||||
|
|
||||||
|
**Reasoning-Based RAG** offers a better alternative: enabling LLMs to *think* and *reason* their way to the most relevant document sections. Inspired by **AlphaGo**, we leverage **tree search** to perform structured document retrieval.
|
||||||
|
|
||||||
|
**PageIndex** is an indexing system that builds search trees from long documents, making them ready for reasoning-based RAG.
|
||||||
|
|
||||||
|
Built by [Vectify AI](https://vectify.ai/pageindex)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔍 What is PageIndex?
|
||||||
|
|
||||||
|
**PageIndex** transforms lengthy PDF documents into a semantic **tree structure**, similar to a "table of contents" but optimized for use with Large Language Models (LLMs).
|
||||||
|
It’s ideal for: financial reports, regulatory filings, academic textbooks, legal or technical manuals or any document that exceeds LLM context limits.
|
||||||
|
|
||||||
|
### ✅ Key Features
|
||||||
|
|
||||||
|
- **Scales to Massive Documents**
|
||||||
|
Designed to handle hundreds or even thousands of pages with ease.
|
||||||
|
|
||||||
|
- **Hierarchical Tree Structure**
|
||||||
|
Enables LLMs to traverse documents logically—like an intelligent, LLM-optimized table of contents.
|
||||||
|
|
||||||
|
- **Precise Page Referencing**
|
||||||
|
Every node contains its own summary and start/end page physical index, allowing pinpoint retrieval.
|
||||||
|
|
||||||
|
- **Chunk-Free Segmentation**
|
||||||
|
No arbitrary chunking. Nodes follow the natural structure of the document.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📦 PageIndex Format
|
||||||
|
|
||||||
|
Here is an example output. See more [example documents](https://github.com/VectifyAI/PageIndex/tree/main/docs) and [generated trees](https://github.com/VectifyAI/PageIndex/tree/main/results).
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"title": "Financial Stability",
|
||||||
|
"node_id": "0006",
|
||||||
|
"start_index": 21,
|
||||||
|
"end_index": 22,
|
||||||
|
"summary": "The Federal Reserve ...",
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Monitoring Financial Vulnerabilities",
|
||||||
|
"node_id": "0007",
|
||||||
|
"start_index": 22,
|
||||||
|
"end_index": 28,
|
||||||
|
"summary": "The Federal Reserve's monitoring ..."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Domestic and International Cooperation and Coordination",
|
||||||
|
"node_id": "0008",
|
||||||
|
"start_index": 28,
|
||||||
|
"end_index": 31,
|
||||||
|
"summary": "In 2023, the Federal Reserve collaborated ..."
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
```
|
||||||
|
Notice: the node_id and summary generation function will be added soon.
|
||||||
|
|
||||||
|
## 🧠 Reasoning-Based RAG with PageIndex
|
||||||
|
|
||||||
|
Use PageIndex to build **reasoning-based retrieval systems** without relying on semantic similarity. Great for domain-specific tasks where nuance matters.
|
||||||
|
|
||||||
|
### 🛠️ Example Prompt
|
||||||
|
|
||||||
|
```python
|
||||||
|
prompt = f"""
|
||||||
|
You are given a question and a tree structure of a document.
|
||||||
|
You need to find all nodes that are likely to contain the answer.
|
||||||
|
|
||||||
|
Question: {question}
|
||||||
|
|
||||||
|
Document tree structure: {structure}
|
||||||
|
|
||||||
|
Reply in the following JSON format:
|
||||||
|
{{
|
||||||
|
"thinking": <reasoning about where to look>,
|
||||||
|
"node_list": [node_id1, node_id2, ...]
|
||||||
|
}}
|
||||||
|
"""
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🚀 Usage
|
||||||
|
|
||||||
|
Follow these steps to generate a PageIndex tree from a PDF document.
|
||||||
|
|
||||||
|
### 1. Install dependencies
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip3 install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Set your OpenAI API key
|
||||||
|
|
||||||
|
Create a `.env` file in the root directory and add your API key:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
CHATGPT_API_KEY=your_openai_key_here
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Run PageIndex on your PDF
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 page_index.py --pdf_path /path/to/your/document.pdf
|
||||||
|
```
|
||||||
|
|
||||||
|
The results will be saved in the `./results/` directory.
|
||||||
|
|
||||||
|
## 🛤 Roadmap
|
||||||
|
|
||||||
|
- [ ] Add node summary and document selection
|
||||||
|
- [ ] Technical report on PageIndex design
|
||||||
|
- [ ] Efficient tree search algorithms for large documents
|
||||||
|
- [ ] Integration with vector-based semantic retrieval
|
||||||
|
|
||||||
|
## 📈 Case Study: Mafin 2.5
|
||||||
|
|
||||||
|
[Mafin 2.5](https://vectify.ai/blog/Mafin2.5) is a state-of-the-art reasoning-based RAG model designed specifically for financial document analysis. Built on top of **PageIndex**, it achieved an impressive **98.7% accuracy** on the [FinanceBench](https://github.com/VectifyAI/Mafin2.5-FinanceBench) benchmark—significantly outperforming traditional vector-based RAG systems.
|
||||||
|
|
||||||
|
PageIndex’s hierarchical indexing enabled precise navigation and extraction of relevant content from complex financial reports, such as SEC filings and earnings disclosures.
|
||||||
|
|
||||||
|
👉 See full [benchmark results](https://github.com/VectifyAI/Mafin2.5-FinanceBench) for detailed comparisons and performance metrics.
|
||||||
|
|
||||||
|
## 📬 Contact Us
|
||||||
|
|
||||||
|
Need customized support for your documents or reasoning-based RAG system?
|
||||||
|
|
||||||
|
👉 [Contact us here](https://ii2abc2jejf.typeform.com/to/meB40zV0)
|
||||||
0
__init__.py
Normal file
0
__init__.py
Normal file
BIN
docs/2023-annual-report.pdf
Normal file
BIN
docs/2023-annual-report.pdf
Normal file
Binary file not shown.
BIN
docs/PRML.pdf
Normal file
BIN
docs/PRML.pdf
Normal file
Binary file not shown.
BIN
docs/Regulation Best Interest_Interpretive release.pdf
Normal file
BIN
docs/Regulation Best Interest_Interpretive release.pdf
Normal file
Binary file not shown.
BIN
docs/Regulation Best Interest_proposed rule.pdf
Normal file
BIN
docs/Regulation Best Interest_proposed rule.pdf
Normal file
Binary file not shown.
BIN
docs/q1-fy25-earnings.pdf
Normal file
BIN
docs/q1-fy25-earnings.pdf
Normal file
Binary file not shown.
1073
page_index.py
Normal file
1073
page_index.py
Normal file
File diff suppressed because it is too large
Load diff
5
requirements.txt
Normal file
5
requirements.txt
Normal file
|
|
@ -0,0 +1,5 @@
|
||||||
|
openai==1.70.0
|
||||||
|
pymupdf==1.25.5
|
||||||
|
PyPDF2==3.0.1
|
||||||
|
python-dotenv==1.1.0
|
||||||
|
tiktoken==0.7.0
|
||||||
460
results/2023-annual-report_structure.json
Normal file
460
results/2023-annual-report_structure.json
Normal file
|
|
@ -0,0 +1,460 @@
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"title": "Preface",
|
||||||
|
"start_index": 1,
|
||||||
|
"end_index": 4
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "About the Federal Reserve",
|
||||||
|
"start_index": 5,
|
||||||
|
"end_index": 7
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Overview",
|
||||||
|
"start_index": 7,
|
||||||
|
"end_index": 8
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Monetary Policy and Economic Developments",
|
||||||
|
"start_index": 9,
|
||||||
|
"end_index": 9,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "March 2024 Summary",
|
||||||
|
"start_index": 9,
|
||||||
|
"end_index": 14
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "June 2023 Summary",
|
||||||
|
"start_index": 15,
|
||||||
|
"end_index": 20
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Financial Stability",
|
||||||
|
"start_index": 21,
|
||||||
|
"end_index": 21,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Monitoring Financial Vulnerabilities",
|
||||||
|
"start_index": 22,
|
||||||
|
"end_index": 28
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Domestic and International Cooperation and Coordination",
|
||||||
|
"start_index": 28,
|
||||||
|
"end_index": 31
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Supervision and Regulation",
|
||||||
|
"start_index": 31,
|
||||||
|
"end_index": 31,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Supervised and Regulated Institutions",
|
||||||
|
"start_index": 32,
|
||||||
|
"end_index": 35
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Supervisory Developments",
|
||||||
|
"start_index": 35,
|
||||||
|
"end_index": 54
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Regulatory Developments",
|
||||||
|
"start_index": 55,
|
||||||
|
"end_index": 59
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Payment System and Reserve Bank Oversight",
|
||||||
|
"start_index": 59,
|
||||||
|
"end_index": 59,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Payment Services to Depository and Other Institutions",
|
||||||
|
"start_index": 60,
|
||||||
|
"end_index": 65
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Currency and Coin",
|
||||||
|
"start_index": 66,
|
||||||
|
"end_index": 68
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Fiscal Agency and Government Depository Services",
|
||||||
|
"start_index": 69,
|
||||||
|
"end_index": 72
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Evolutions and Improvements to the System",
|
||||||
|
"start_index": 72,
|
||||||
|
"end_index": 75
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Oversight of Federal Reserve Banks",
|
||||||
|
"start_index": 75,
|
||||||
|
"end_index": 81
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Pro Forma Financial Statements for Federal Reserve Priced Services",
|
||||||
|
"start_index": 82,
|
||||||
|
"end_index": 88
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Consumer and Community Affairs",
|
||||||
|
"start_index": 89,
|
||||||
|
"end_index": 89,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Consumer Compliance Supervision",
|
||||||
|
"start_index": 89,
|
||||||
|
"end_index": 101
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Consumer Laws and Regulations",
|
||||||
|
"start_index": 101,
|
||||||
|
"end_index": 102
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Consumer Research and Analysis of Emerging Issues and Policy",
|
||||||
|
"start_index": 102,
|
||||||
|
"end_index": 105
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Community Development",
|
||||||
|
"start_index": 105,
|
||||||
|
"end_index": 106
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Appendixes",
|
||||||
|
"start_index": 107,
|
||||||
|
"end_index": 108
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Federal Reserve System Organization",
|
||||||
|
"start_index": 109,
|
||||||
|
"end_index": 109,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Board of Governors",
|
||||||
|
"start_index": 109,
|
||||||
|
"end_index": 116
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Federal Open Market Committee",
|
||||||
|
"start_index": 117,
|
||||||
|
"end_index": 118
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Board of Governors Advisory Councils",
|
||||||
|
"start_index": 119,
|
||||||
|
"end_index": 122
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Federal Reserve Banks and Branches",
|
||||||
|
"start_index": 123,
|
||||||
|
"end_index": 146
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Minutes of Federal Open Market Committee Meetings",
|
||||||
|
"start_index": 147,
|
||||||
|
"end_index": 147,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Meeting Minutes",
|
||||||
|
"start_index": 147,
|
||||||
|
"end_index": 149
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Federal Reserve System Audits",
|
||||||
|
"start_index": 149,
|
||||||
|
"end_index": 149,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Office of Inspector General Activities",
|
||||||
|
"start_index": 149,
|
||||||
|
"end_index": 151
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Government Accountability Office Reviews",
|
||||||
|
"start_index": 151,
|
||||||
|
"end_index": 153
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Federal Reserve System Budgets",
|
||||||
|
"start_index": 153,
|
||||||
|
"end_index": 153,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "System Budgets Overview",
|
||||||
|
"start_index": 153,
|
||||||
|
"end_index": 157
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Board of Governors Budgets",
|
||||||
|
"start_index": 157,
|
||||||
|
"end_index": 163
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Federal Reserve Banks Budgets",
|
||||||
|
"start_index": 163,
|
||||||
|
"end_index": 169
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Currency Budget",
|
||||||
|
"start_index": 169,
|
||||||
|
"end_index": 174
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Record of Policy Actions of the Board of Governors",
|
||||||
|
"start_index": 175,
|
||||||
|
"end_index": 175,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Rules and Regulations",
|
||||||
|
"start_index": 175,
|
||||||
|
"end_index": 176
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Policy Statements and Other Actions",
|
||||||
|
"start_index": 177,
|
||||||
|
"end_index": 181
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Discount Rates for Depository Institutions in 2023",
|
||||||
|
"start_index": 181,
|
||||||
|
"end_index": 183
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "The Board of Governors and the Government Performance and Results Act",
|
||||||
|
"start_index": 184,
|
||||||
|
"end_index": 184
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Litigation",
|
||||||
|
"start_index": 185,
|
||||||
|
"end_index": 185,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Pending",
|
||||||
|
"start_index": 185,
|
||||||
|
"end_index": 186
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Resolved",
|
||||||
|
"start_index": 186,
|
||||||
|
"end_index": 186
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Statistical Tables",
|
||||||
|
"start_index": 187,
|
||||||
|
"end_index": 187,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Federal Reserve open market transactions, 2023",
|
||||||
|
"start_index": 187,
|
||||||
|
"end_index": 187,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Federal Reserve open market transactions, 2023\u2014continued",
|
||||||
|
"start_index": 187,
|
||||||
|
"end_index": 188
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Federal Reserve Bank holdings of U.S. Treasury and federal agency securities, December 31, 2021\u201323",
|
||||||
|
"start_index": 189,
|
||||||
|
"end_index": 188,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Federal Reserve Bank holdings of U.S. Treasury and federal agency securities, December 31, 2021\u201323\u2014continued",
|
||||||
|
"start_index": 189,
|
||||||
|
"end_index": 190
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Reserve requirements of depository institutions, December 31, 2023",
|
||||||
|
"start_index": 191,
|
||||||
|
"end_index": 191
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Banking offices and banks affiliated with bank holding companies in the United States, December 31, 2022 and 2023",
|
||||||
|
"start_index": 192,
|
||||||
|
"end_index": 192
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1984\u20132023 and month-end 2023",
|
||||||
|
"start_index": 193,
|
||||||
|
"end_index": 194,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1984\u20132023 and month-end 2023\u2014continued",
|
||||||
|
"start_index": 194,
|
||||||
|
"end_index": 194
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1984\u20132023 and month-end 2023\u2014continued",
|
||||||
|
"start_index": 195,
|
||||||
|
"end_index": 196
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1984\u20132023 and month-end 2023\u2014continued",
|
||||||
|
"start_index": 196,
|
||||||
|
"end_index": 196
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1918\u20131983",
|
||||||
|
"start_index": 197,
|
||||||
|
"end_index": 198,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1918\u20131983\u2014continued",
|
||||||
|
"start_index": 199,
|
||||||
|
"end_index": 198
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1918\u20131983\u2014continued",
|
||||||
|
"start_index": 199,
|
||||||
|
"end_index": 198
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1918\u20131983\u2014continued",
|
||||||
|
"start_index": 199,
|
||||||
|
"end_index": 200
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Principal assets and liabilities of insured commercial banks, by class of bank, June 30, 2023 and 2022",
|
||||||
|
"start_index": 201,
|
||||||
|
"end_index": 201
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Initial margin requirements under Regulations T, U, and X",
|
||||||
|
"start_index": 202,
|
||||||
|
"end_index": 203
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022",
|
||||||
|
"start_index": 203,
|
||||||
|
"end_index": 206,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022\u2014continued",
|
||||||
|
"start_index": 206,
|
||||||
|
"end_index": 206
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022\u2014continued",
|
||||||
|
"start_index": 206,
|
||||||
|
"end_index": 206
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022\u2014continued",
|
||||||
|
"start_index": 206,
|
||||||
|
"end_index": 206
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022\u2014continued",
|
||||||
|
"start_index": 206,
|
||||||
|
"end_index": 206
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022\u2014continued",
|
||||||
|
"start_index": 206,
|
||||||
|
"end_index": 209
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Statement of condition of the Federal Reserve Banks, December 31, 2023 and 2022",
|
||||||
|
"start_index": 209,
|
||||||
|
"end_index": 210
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Income and expenses of the Federal Reserve Banks, by Bank, 2023",
|
||||||
|
"start_index": 210,
|
||||||
|
"end_index": 211,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Income and expenses of the Federal Reserve Banks, by Bank, 2023\u2014continued",
|
||||||
|
"start_index": 211,
|
||||||
|
"end_index": 212
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Income and expenses of the Federal Reserve Banks, by Bank, 2023\u2014continued",
|
||||||
|
"start_index": 212,
|
||||||
|
"end_index": 212
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Income and expenses of the Federal Reserve Banks, by Bank, 2023\u2014continued",
|
||||||
|
"start_index": 212,
|
||||||
|
"end_index": 214
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023",
|
||||||
|
"start_index": 214,
|
||||||
|
"end_index": 214,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023\u2014continued",
|
||||||
|
"start_index": 214,
|
||||||
|
"end_index": 214
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023\u2014continued",
|
||||||
|
"start_index": 214,
|
||||||
|
"end_index": 217
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023\u2014continued",
|
||||||
|
"start_index": 217,
|
||||||
|
"end_index": 217
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Operations in principal departments of the Federal Reserve Banks, 2020\u201323",
|
||||||
|
"start_index": 218,
|
||||||
|
"end_index": 218
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Number and annual salaries of officers and employees of the Federal Reserve Banks, December 31, 2023",
|
||||||
|
"start_index": 219,
|
||||||
|
"end_index": 220
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Acquisition costs and net book value of the premises of the Federal Reserve Banks and Branches, December 31, 2023",
|
||||||
|
"start_index": 220,
|
||||||
|
"end_index": 222
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
1558
results/PRML_structure.json
Normal file
1558
results/PRML_structure.json
Normal file
File diff suppressed because it is too large
Load diff
|
|
@ -0,0 +1,51 @@
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"title": "Preface",
|
||||||
|
"start_index": 1,
|
||||||
|
"end_index": 2
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Introduction",
|
||||||
|
"start_index": 2,
|
||||||
|
"end_index": 6
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Interpretation and Application",
|
||||||
|
"start_index": 6,
|
||||||
|
"end_index": 8,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Historical Context and Legislative History",
|
||||||
|
"start_index": 8,
|
||||||
|
"end_index": 10
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Scope of the Solely Incidental Prong of the Broker-Dealer Exclusion",
|
||||||
|
"start_index": 10,
|
||||||
|
"end_index": 14
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Guidance on Applying the Interpretation of the Solely Incidental Prong",
|
||||||
|
"start_index": 14,
|
||||||
|
"end_index": 22
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Economic Considerations",
|
||||||
|
"start_index": 22,
|
||||||
|
"end_index": 22,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Background",
|
||||||
|
"start_index": 22,
|
||||||
|
"end_index": 23
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Potential Economic Effects",
|
||||||
|
"start_index": 23,
|
||||||
|
"end_index": 28
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
466
results/Regulation Best Interest_proposed rule_structure.json
Normal file
466
results/Regulation Best Interest_proposed rule_structure.json
Normal file
|
|
@ -0,0 +1,466 @@
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"title": "Preface",
|
||||||
|
"start_index": 1,
|
||||||
|
"end_index": 6
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "INTRODUCTION",
|
||||||
|
"start_index": 6,
|
||||||
|
"end_index": 12,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Background",
|
||||||
|
"start_index": 12,
|
||||||
|
"end_index": 22,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Evaluation of Standards of Conduct Applicable to Investment Advice",
|
||||||
|
"start_index": 22,
|
||||||
|
"end_index": 26
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "DOL Rulemaking",
|
||||||
|
"start_index": 26,
|
||||||
|
"end_index": 32
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Statement by Chairman Clayton",
|
||||||
|
"start_index": 32,
|
||||||
|
"end_index": 36
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "General Objectives of Proposed Approach",
|
||||||
|
"start_index": 36,
|
||||||
|
"end_index": 44
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "DISCUSSION OF REGULATION BEST INTEREST",
|
||||||
|
"start_index": 44,
|
||||||
|
"end_index": 44,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Overview of Regulation Best Interest",
|
||||||
|
"start_index": 44,
|
||||||
|
"end_index": 50
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Best Interest, Generally",
|
||||||
|
"start_index": 50,
|
||||||
|
"end_index": 58,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Consistency with Other Approaches",
|
||||||
|
"start_index": 58,
|
||||||
|
"end_index": 66
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Request for Comment on the Best Interest Obligation",
|
||||||
|
"start_index": 66,
|
||||||
|
"end_index": 71
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Key Terms and Scope of Best Interest Obligation",
|
||||||
|
"start_index": 71,
|
||||||
|
"end_index": 71,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Natural Person who is an Associated Person",
|
||||||
|
"start_index": 71,
|
||||||
|
"end_index": 72
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "When Making a Recommendation, At Time Recommendation is Made",
|
||||||
|
"start_index": 72,
|
||||||
|
"end_index": 82
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Any Securities Transaction or Investment Strategy",
|
||||||
|
"start_index": 82,
|
||||||
|
"end_index": 83
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Retail Customer",
|
||||||
|
"start_index": 83,
|
||||||
|
"end_index": 90
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Request for Comment on Key Terms and Scope of Best Interest Obligation",
|
||||||
|
"start_index": 90,
|
||||||
|
"end_index": 96
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Components of Regulation Best Interest",
|
||||||
|
"start_index": 96,
|
||||||
|
"end_index": 97,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Disclosure Obligation",
|
||||||
|
"start_index": 97,
|
||||||
|
"end_index": 133
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Care Obligation",
|
||||||
|
"start_index": 133,
|
||||||
|
"end_index": 166
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Conflict of Interest Obligations",
|
||||||
|
"start_index": 166,
|
||||||
|
"end_index": 196
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Recordkeeping and Retention",
|
||||||
|
"start_index": 196,
|
||||||
|
"end_index": 199
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Whether the Exercise of Investment Discretion Should be Viewed as Solely Incidental to the Business of a Broker or Dealer",
|
||||||
|
"start_index": 199,
|
||||||
|
"end_index": 209
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "REQUEST FOR COMMENT",
|
||||||
|
"start_index": 209,
|
||||||
|
"end_index": 210,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Generally",
|
||||||
|
"start_index": 210,
|
||||||
|
"end_index": 212
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Interactions with Other Standards of Conduct",
|
||||||
|
"start_index": 212,
|
||||||
|
"end_index": 214
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "ECONOMIC ANALYSIS",
|
||||||
|
"start_index": 214,
|
||||||
|
"end_index": 214,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Introduction, Primary Goals of Proposed Regulations and Broad Economic Considerations",
|
||||||
|
"start_index": 214,
|
||||||
|
"end_index": 214,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Introduction and Primary Goals of Proposed Regulation",
|
||||||
|
"start_index": 214,
|
||||||
|
"end_index": 215
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Broad Economic Considerations",
|
||||||
|
"start_index": 215,
|
||||||
|
"end_index": 225
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Economic Baseline",
|
||||||
|
"start_index": 225,
|
||||||
|
"end_index": 225,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Market for Advice Services",
|
||||||
|
"start_index": 225,
|
||||||
|
"end_index": 246
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Regulatory Baseline",
|
||||||
|
"start_index": 246,
|
||||||
|
"end_index": 255
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Benefits, Costs, and Effects on Efficiency, Competition, and Capital Formation",
|
||||||
|
"start_index": 255,
|
||||||
|
"end_index": 258,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Benefits",
|
||||||
|
"start_index": 258,
|
||||||
|
"end_index": 272
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Costs",
|
||||||
|
"start_index": 272,
|
||||||
|
"end_index": 275,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Standard of Conduct Defined as Best Interest",
|
||||||
|
"start_index": 275,
|
||||||
|
"end_index": 275,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Operational Costs",
|
||||||
|
"start_index": 275,
|
||||||
|
"end_index": 277
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Programmatic Costs",
|
||||||
|
"start_index": 278,
|
||||||
|
"end_index": 280
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Disclosure Obligation",
|
||||||
|
"start_index": 280,
|
||||||
|
"end_index": 286
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Obligation to Exercise Reasonable Diligence, Care, Skill, and Prudence in Making a Recommendation",
|
||||||
|
"start_index": 286,
|
||||||
|
"end_index": 290
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Obligation to Establish, Maintain, and Enforce Written Policies and Procedures Reasonably Designed to Identify and at a Minimum Disclose, or Eliminate, All Material Conflicts of Interest Associated with a Recommendation",
|
||||||
|
"start_index": 290,
|
||||||
|
"end_index": 295,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Eliminate Material Conflicts of Interest Associated with a Recommendation",
|
||||||
|
"start_index": 295,
|
||||||
|
"end_index": 297
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "At a Minimum Disclose Material Conflicts of Interest Associated with a Recommendation",
|
||||||
|
"start_index": 297,
|
||||||
|
"end_index": 299
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Obligation to Establish, Maintain, and Enforce Written Policies and Procedures Reasonably Designed to Identify and Disclose and Mitigate, or Eliminate, Material Conflicts of Interest Arising from Financial Incentives Associated with a Recommendation",
|
||||||
|
"start_index": 299,
|
||||||
|
"end_index": 300,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Eliminate Material Conflicts Arising from Financial Incentives Associated with a Recommendation",
|
||||||
|
"start_index": 300,
|
||||||
|
"end_index": 304
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Disclose and Mitigate Material Conflicts of Interest Arising from Financial Incentives Associated with a Recommendation",
|
||||||
|
"start_index": 304,
|
||||||
|
"end_index": 316
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Effects on Efficiency, Competition, and Capital Formation",
|
||||||
|
"start_index": 316,
|
||||||
|
"end_index": 324
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Reasonable Alternatives",
|
||||||
|
"start_index": 324,
|
||||||
|
"end_index": 325,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Disclosure-Only Alternative",
|
||||||
|
"start_index": 325,
|
||||||
|
"end_index": 327
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Principles-Based Standard of Conduct Obligation",
|
||||||
|
"start_index": 327,
|
||||||
|
"end_index": 328
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "A Fiduciary Standard for Broker-Dealers",
|
||||||
|
"start_index": 328,
|
||||||
|
"end_index": 332
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Enhanced Standards Akin to Conditions of the BIC Exemption",
|
||||||
|
"start_index": 332,
|
||||||
|
"end_index": 335
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Request for Comment",
|
||||||
|
"start_index": 335,
|
||||||
|
"end_index": 338
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "PAPERWORK REDUCTION ACT ANALYSIS",
|
||||||
|
"start_index": 338,
|
||||||
|
"end_index": 340,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Respondents Subject to Proposed Regulation Best Interest and Proposed Amendments to Rule 17a-3(a)(25), Rule 17a-4(e)(5)",
|
||||||
|
"start_index": 340,
|
||||||
|
"end_index": 340,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Broker-Dealers",
|
||||||
|
"start_index": 340,
|
||||||
|
"end_index": 340
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Natural Persons Who Are Associated Persons of Broker-Dealers",
|
||||||
|
"start_index": 340,
|
||||||
|
"end_index": 341
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Summary of Collections of Information",
|
||||||
|
"start_index": 341,
|
||||||
|
"end_index": 342,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Conflict of Interest Obligations",
|
||||||
|
"start_index": 342,
|
||||||
|
"end_index": 353
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Disclosure Obligation",
|
||||||
|
"start_index": 353,
|
||||||
|
"end_index": 370
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Care Obligation",
|
||||||
|
"start_index": 370,
|
||||||
|
"end_index": 370
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Record-Making and Recordkeeping Obligations",
|
||||||
|
"start_index": 370,
|
||||||
|
"end_index": 375
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Collection of Information is Mandatory",
|
||||||
|
"start_index": 375,
|
||||||
|
"end_index": 375
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Confidentiality",
|
||||||
|
"start_index": 375,
|
||||||
|
"end_index": 376
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Request for Comment",
|
||||||
|
"start_index": 376,
|
||||||
|
"end_index": 377
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "SMALL BUSINESS REGULATORY ENFORCEMENT FAIRNESS ACT",
|
||||||
|
"start_index": 377,
|
||||||
|
"end_index": 378
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "INITIAL REGULATORY FLEXIBILITY ACT ANALYSIS",
|
||||||
|
"start_index": 378,
|
||||||
|
"end_index": 379,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Reasons for and Objectives of the Proposed Action",
|
||||||
|
"start_index": 379,
|
||||||
|
"end_index": 381
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Legal Basis",
|
||||||
|
"start_index": 381,
|
||||||
|
"end_index": 381
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Small Entities Subject to the Proposed Rule",
|
||||||
|
"start_index": 381,
|
||||||
|
"end_index": 382
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Projected Compliance Requirements of the Proposed Rule for Small Entities",
|
||||||
|
"start_index": 382,
|
||||||
|
"end_index": 383,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Conflict of Interest Obligations",
|
||||||
|
"start_index": 383,
|
||||||
|
"end_index": 386
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Disclosure Obligations",
|
||||||
|
"start_index": 387,
|
||||||
|
"end_index": 394
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Obligation to Exercise Reasonable Diligence, Care, Skill and Prudence",
|
||||||
|
"start_index": 394,
|
||||||
|
"end_index": 394
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Record-Making and Recordkeeping Obligations",
|
||||||
|
"start_index": 394,
|
||||||
|
"end_index": 397
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Duplicative, Overlapping, or Conflicting Federal Rules",
|
||||||
|
"start_index": 397,
|
||||||
|
"end_index": 398
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Significant Alternatives",
|
||||||
|
"start_index": 398,
|
||||||
|
"end_index": 401,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Disclosure-Only Alternative",
|
||||||
|
"start_index": 401,
|
||||||
|
"end_index": 401
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Principles-Based Alternative",
|
||||||
|
"start_index": 401,
|
||||||
|
"end_index": 402
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Enhanced Standards Akin to BIC Exemption",
|
||||||
|
"start_index": 402,
|
||||||
|
"end_index": 403
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "General Request for Comment",
|
||||||
|
"start_index": 403,
|
||||||
|
"end_index": 403
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "STATUTORY AUTHORITY AND TEXT OF PROPOSED RULE",
|
||||||
|
"start_index": 403,
|
||||||
|
"end_index": 408
|
||||||
|
}
|
||||||
|
]
|
||||||
220
results/q1-fy25-earnings_structure.json
Normal file
220
results/q1-fy25-earnings_structure.json
Normal file
|
|
@ -0,0 +1,220 @@
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"title": "THE WALT DISNEY COMPANY REPORTS FIRST QUARTER EARNINGS FOR FISCAL 2025",
|
||||||
|
"start_index": 1,
|
||||||
|
"end_index": 1,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Financial Results for the Quarter",
|
||||||
|
"start_index": 1,
|
||||||
|
"end_index": 1,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Key Points",
|
||||||
|
"start_index": 1,
|
||||||
|
"end_index": 1
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Guidance and Outlook",
|
||||||
|
"start_index": 2,
|
||||||
|
"end_index": 2,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Star India deconsolidated in Q1",
|
||||||
|
"start_index": 2,
|
||||||
|
"end_index": 2
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Q2 Fiscal 2025",
|
||||||
|
"start_index": 2,
|
||||||
|
"end_index": 2
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Fiscal Year 2025",
|
||||||
|
"start_index": 2,
|
||||||
|
"end_index": 2
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Message From Our CEO",
|
||||||
|
"start_index": 2,
|
||||||
|
"end_index": 2
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "SUMMARIZED FINANCIAL RESULTS",
|
||||||
|
"start_index": 3,
|
||||||
|
"end_index": 3,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "SUMMARIZED SEGMENT FINANCIAL RESULTS",
|
||||||
|
"start_index": 3,
|
||||||
|
"end_index": 3
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "DISCUSSION OF FIRST QUARTER SEGMENT RESULTS",
|
||||||
|
"start_index": 4,
|
||||||
|
"end_index": 4,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Star India",
|
||||||
|
"start_index": 4,
|
||||||
|
"end_index": 4
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Entertainment",
|
||||||
|
"start_index": 4,
|
||||||
|
"end_index": 4,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Linear Networks",
|
||||||
|
"start_index": 5,
|
||||||
|
"end_index": 5
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Direct-to-Consumer",
|
||||||
|
"start_index": 5,
|
||||||
|
"end_index": 7
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Content Sales/Licensing and Other",
|
||||||
|
"start_index": 7,
|
||||||
|
"end_index": 7
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Sports",
|
||||||
|
"start_index": 7,
|
||||||
|
"end_index": 7,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Domestic ESPN",
|
||||||
|
"start_index": 8,
|
||||||
|
"end_index": 8
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "International ESPN",
|
||||||
|
"start_index": 8,
|
||||||
|
"end_index": 8
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Star India",
|
||||||
|
"start_index": 8,
|
||||||
|
"end_index": 8
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Experiences",
|
||||||
|
"start_index": 9,
|
||||||
|
"end_index": 9,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Domestic Parks and Experiences",
|
||||||
|
"start_index": 9,
|
||||||
|
"end_index": 9
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "International Parks and Experiences",
|
||||||
|
"start_index": 9,
|
||||||
|
"end_index": 9
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "OTHER FINANCIAL INFORMATION",
|
||||||
|
"start_index": 9,
|
||||||
|
"end_index": 9,
|
||||||
|
"child_nodes": [
|
||||||
|
{
|
||||||
|
"title": "Corporate and Unallocated Shared Expenses",
|
||||||
|
"start_index": 9,
|
||||||
|
"end_index": 9
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Restructuring and Impairment Charges",
|
||||||
|
"start_index": 9,
|
||||||
|
"end_index": 9
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Interest Expense, net",
|
||||||
|
"start_index": 10,
|
||||||
|
"end_index": 10
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Equity in the Income of Investees",
|
||||||
|
"start_index": 10,
|
||||||
|
"end_index": 10
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Income Taxes",
|
||||||
|
"start_index": 10,
|
||||||
|
"end_index": 10
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Noncontrolling Interests",
|
||||||
|
"start_index": 11,
|
||||||
|
"end_index": 11
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Cash from Operations",
|
||||||
|
"start_index": 11,
|
||||||
|
"end_index": 11
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Capital Expenditures",
|
||||||
|
"start_index": 12,
|
||||||
|
"end_index": 12
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Depreciation Expense",
|
||||||
|
"start_index": 12,
|
||||||
|
"end_index": 12
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "THE WALT DISNEY COMPANY CONDENSED CONSOLIDATED STATEMENTS OF INCOME",
|
||||||
|
"start_index": 13,
|
||||||
|
"end_index": 13
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "THE WALT DISNEY COMPANY CONDENSED CONSOLIDATED BALANCE SHEETS",
|
||||||
|
"start_index": 14,
|
||||||
|
"end_index": 14
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "THE WALT DISNEY COMPANY CONDENSED CONSOLIDATED STATEMENTS OF CASH FLOWS",
|
||||||
|
"start_index": 15,
|
||||||
|
"end_index": 15
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "DTC PRODUCT DESCRIPTIONS AND KEY DEFINITIONS",
|
||||||
|
"start_index": 16,
|
||||||
|
"end_index": 16
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "NON-GAAP FINANCIAL MEASURES",
|
||||||
|
"start_index": 17,
|
||||||
|
"end_index": 20
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "FORWARD-LOOKING STATEMENTS",
|
||||||
|
"start_index": 21,
|
||||||
|
"end_index": 21
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "PREPARED EARNINGS REMARKS AND CONFERENCE CALL INFORMATION",
|
||||||
|
"start_index": 22,
|
||||||
|
"end_index": 22
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
524
utils.py
Normal file
524
utils.py
Normal file
|
|
@ -0,0 +1,524 @@
|
||||||
|
import tiktoken
|
||||||
|
import openai
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
from datetime import datetime
|
||||||
|
import time
|
||||||
|
import json
|
||||||
|
import PyPDF2
|
||||||
|
import copy
|
||||||
|
import asyncio
|
||||||
|
import pymupdf
|
||||||
|
from io import BytesIO
|
||||||
|
import logging
|
||||||
|
|
||||||
|
|
||||||
|
def count_tokens(text, model):
|
||||||
|
enc = tiktoken.encoding_for_model(model)
|
||||||
|
tokens = enc.encode(text)
|
||||||
|
return len(tokens)
|
||||||
|
|
||||||
|
def ChatGPT_API_with_finish_reason(model, prompt, api_key, chat_history=None):
|
||||||
|
max_retries = 10
|
||||||
|
client = openai.OpenAI(api_key=api_key)
|
||||||
|
for i in range(max_retries):
|
||||||
|
try:
|
||||||
|
if chat_history:
|
||||||
|
messages = chat_history
|
||||||
|
messages.append({"role": "user", "content": prompt})
|
||||||
|
else:
|
||||||
|
messages = [{"role": "user", "content": prompt}]
|
||||||
|
|
||||||
|
response = client.chat.completions.create(
|
||||||
|
model=model,
|
||||||
|
messages=messages,
|
||||||
|
temperature=0,
|
||||||
|
)
|
||||||
|
if response.choices[0].finish_reason == "length":
|
||||||
|
return response.choices[0].message.content, "max_output_reached"
|
||||||
|
else:
|
||||||
|
return response.choices[0].message.content, "finished"
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print('************* Retrying *************')
|
||||||
|
logging.error(f"Error: {e}")
|
||||||
|
if i < max_retries - 1:
|
||||||
|
time.sleep(1) # Wait for 1秒 before retrying
|
||||||
|
else:
|
||||||
|
logging.error('Max retries reached for prompt: ' + prompt)
|
||||||
|
return "Error"
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def ChatGPT_API(model, prompt, api_key, chat_history=None):
|
||||||
|
max_retries = 10
|
||||||
|
client = openai.OpenAI(api_key=api_key)
|
||||||
|
for i in range(max_retries):
|
||||||
|
try:
|
||||||
|
if chat_history:
|
||||||
|
messages = chat_history
|
||||||
|
messages.append({"role": "user", "content": prompt})
|
||||||
|
else:
|
||||||
|
messages = [{"role": "user", "content": prompt}]
|
||||||
|
|
||||||
|
response = client.chat.completions.create(
|
||||||
|
model=model,
|
||||||
|
messages=messages,
|
||||||
|
temperature=0,
|
||||||
|
)
|
||||||
|
|
||||||
|
return response.choices[0].message.content
|
||||||
|
except Exception as e:
|
||||||
|
print('************* Retrying *************')
|
||||||
|
logging.error(f"Error: {e}")
|
||||||
|
if i < max_retries - 1:
|
||||||
|
time.sleep(1) # Wait for 1秒 before retrying
|
||||||
|
else:
|
||||||
|
logging.error('Max retries reached for prompt: ' + prompt)
|
||||||
|
return "Error"
|
||||||
|
|
||||||
|
|
||||||
|
async def ChatGPT_API_async(model, prompt, api_key):
|
||||||
|
max_retries = 10
|
||||||
|
client = openai.AsyncOpenAI(api_key=api_key)
|
||||||
|
for i in range(max_retries):
|
||||||
|
try:
|
||||||
|
messages = [{"role": "user", "content": prompt}]
|
||||||
|
response = await client.chat.completions.create(
|
||||||
|
model=model,
|
||||||
|
messages=messages,
|
||||||
|
temperature=0,
|
||||||
|
)
|
||||||
|
return response.choices[0].message.content
|
||||||
|
except Exception as e:
|
||||||
|
print('************* Retrying *************')
|
||||||
|
logging.error(f"Error: {e}")
|
||||||
|
if i < max_retries - 1:
|
||||||
|
await asyncio.sleep(1) # Wait for 1秒 before retrying
|
||||||
|
else:
|
||||||
|
logging.error('Max retries reached for prompt: ' + prompt)
|
||||||
|
return "Error"
|
||||||
|
|
||||||
|
def get_json_content(response):
|
||||||
|
start_idx = response.find("```json")
|
||||||
|
if start_idx != -1:
|
||||||
|
start_idx += 7
|
||||||
|
response = response[start_idx:]
|
||||||
|
|
||||||
|
end_idx = response.rfind("```")
|
||||||
|
if end_idx != -1:
|
||||||
|
response = response[:end_idx]
|
||||||
|
|
||||||
|
json_content = response.strip()
|
||||||
|
return json_content
|
||||||
|
|
||||||
|
|
||||||
|
def extract_json(content):
|
||||||
|
try:
|
||||||
|
# First, try to extract JSON enclosed within ```json and ```
|
||||||
|
start_idx = content.find("```json")
|
||||||
|
if start_idx != -1:
|
||||||
|
start_idx += 7 # Adjust index to start after the delimiter
|
||||||
|
end_idx = content.rfind("```")
|
||||||
|
json_content = content[start_idx:end_idx].strip()
|
||||||
|
else:
|
||||||
|
# If no delimiters, assume entire content could be JSON
|
||||||
|
json_content = content.strip()
|
||||||
|
|
||||||
|
# Clean up common issues that might cause parsing errors
|
||||||
|
json_content = json_content.replace('None', 'null') # Replace Python None with JSON null
|
||||||
|
json_content = json_content.replace('\n', ' ').replace('\r', ' ') # Remove newlines
|
||||||
|
json_content = ' '.join(json_content.split()) # Normalize whitespace
|
||||||
|
|
||||||
|
# Attempt to parse and return the JSON object
|
||||||
|
return json.loads(json_content)
|
||||||
|
except json.JSONDecodeError as e:
|
||||||
|
logging.error(f"Failed to extract JSON: {e}")
|
||||||
|
# Try to clean up the content further if initial parsing fails
|
||||||
|
try:
|
||||||
|
# Remove any trailing commas before closing brackets/braces
|
||||||
|
json_content = json_content.replace(',]', ']').replace(',}', '}')
|
||||||
|
return json.loads(json_content)
|
||||||
|
except:
|
||||||
|
logging.error("Failed to parse JSON even after cleanup")
|
||||||
|
return {}
|
||||||
|
except Exception as e:
|
||||||
|
logging.error(f"Unexpected error while extracting JSON: {e}")
|
||||||
|
return {}
|
||||||
|
|
||||||
|
def write_node_id(data, node_id=0):
|
||||||
|
if isinstance(data, dict):
|
||||||
|
data['node_id'] = str(node_id).zfill(4)
|
||||||
|
node_id += 1
|
||||||
|
for key in list(data.keys()):
|
||||||
|
if 'child_nodes' in key:
|
||||||
|
node_id = write_node_id(data[key], node_id)
|
||||||
|
elif isinstance(data, list):
|
||||||
|
for index in range(len(data)):
|
||||||
|
node_id = write_node_id(data[index], node_id)
|
||||||
|
return node_id
|
||||||
|
|
||||||
|
def get_nodes(structure):
|
||||||
|
if isinstance(structure, dict):
|
||||||
|
structure_node = copy.deepcopy(structure)
|
||||||
|
structure_node.pop('child_nodes', None)
|
||||||
|
nodes = [structure_node]
|
||||||
|
for key in list(structure.keys()):
|
||||||
|
if 'child_nodes' in key:
|
||||||
|
nodes.extend(get_nodes(structure[key]))
|
||||||
|
return nodes
|
||||||
|
elif isinstance(structure, list):
|
||||||
|
nodes = []
|
||||||
|
for item in structure:
|
||||||
|
nodes.extend(get_nodes(item))
|
||||||
|
return nodes
|
||||||
|
|
||||||
|
def structure_to_list(structure):
|
||||||
|
if isinstance(structure, dict):
|
||||||
|
nodes = []
|
||||||
|
nodes.append(structure)
|
||||||
|
if 'child_nodes' in structure:
|
||||||
|
nodes.extend(structure_to_list(structure['child_nodes']))
|
||||||
|
return nodes
|
||||||
|
elif isinstance(structure, list):
|
||||||
|
nodes = []
|
||||||
|
for item in structure:
|
||||||
|
nodes.extend(structure_to_list(item))
|
||||||
|
return nodes
|
||||||
|
|
||||||
|
|
||||||
|
def get_leaf_nodes(structure):
|
||||||
|
if isinstance(structure, dict):
|
||||||
|
if not structure['child_nodes']:
|
||||||
|
structure_node = copy.deepcopy(structure)
|
||||||
|
structure_node.pop('child_nodes', None)
|
||||||
|
return [structure_node]
|
||||||
|
else:
|
||||||
|
leaf_nodes = []
|
||||||
|
for key in list(structure.keys()):
|
||||||
|
if 'child_nodes' in key:
|
||||||
|
leaf_nodes.extend(get_leaf_nodes(structure[key]))
|
||||||
|
return leaf_nodes
|
||||||
|
elif isinstance(structure, list):
|
||||||
|
leaf_nodes = []
|
||||||
|
for item in structure:
|
||||||
|
leaf_nodes.extend(get_leaf_nodes(item))
|
||||||
|
return leaf_nodes
|
||||||
|
|
||||||
|
def is_leaf_node(data, node_id):
|
||||||
|
# Helper function to find the node by its node_id
|
||||||
|
def find_node(data, node_id):
|
||||||
|
if isinstance(data, dict):
|
||||||
|
if data.get('node_id') == node_id:
|
||||||
|
return data
|
||||||
|
for key in data.keys():
|
||||||
|
if 'child_nodes' in key:
|
||||||
|
result = find_node(data[key], node_id)
|
||||||
|
if result:
|
||||||
|
return result
|
||||||
|
elif isinstance(data, list):
|
||||||
|
for item in data:
|
||||||
|
result = find_node(item, node_id)
|
||||||
|
if result:
|
||||||
|
return result
|
||||||
|
return None
|
||||||
|
|
||||||
|
# Find the node with the given node_id
|
||||||
|
node = find_node(data, node_id)
|
||||||
|
|
||||||
|
# Check if the node is a leaf node
|
||||||
|
if node and not node.get('child_nodes'):
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
def get_last_node(structure):
|
||||||
|
return structure[-1]
|
||||||
|
|
||||||
|
|
||||||
|
def extract_text_from_pdf(pdf_path):
|
||||||
|
pdf_reader = PyPDF2.PdfReader(pdf_path)
|
||||||
|
###return text not list
|
||||||
|
text=""
|
||||||
|
for page_num in range(len(pdf_reader.pages)):
|
||||||
|
page = pdf_reader.pages[page_num]
|
||||||
|
text+=page.extract_text()
|
||||||
|
return text
|
||||||
|
|
||||||
|
def get_pdf_title(pdf_path):
|
||||||
|
pdf_reader = PyPDF2.PdfReader(pdf_path)
|
||||||
|
meta = pdf_reader.metadata
|
||||||
|
title = meta.title
|
||||||
|
return title
|
||||||
|
|
||||||
|
def get_text_of_pages(pdf_path, start_page, end_page, tag=True):
|
||||||
|
pdf_reader = PyPDF2.PdfReader(pdf_path)
|
||||||
|
text = ""
|
||||||
|
for page_num in range(start_page-1, end_page):
|
||||||
|
page = pdf_reader.pages[page_num]
|
||||||
|
page_text = page.extract_text()
|
||||||
|
if tag:
|
||||||
|
text += f"<start_index_{page_num+1}>\n{page_text}\n<end_index_{page_num+1}>\n"
|
||||||
|
else:
|
||||||
|
text += page_text
|
||||||
|
return text
|
||||||
|
|
||||||
|
def get_first_start_page_from_text(text):
|
||||||
|
start_page = -1
|
||||||
|
start_page_match = re.search(r'<start_index_(\d+)>', text)
|
||||||
|
if start_page_match:
|
||||||
|
start_page = int(start_page_match.group(1))
|
||||||
|
return start_page
|
||||||
|
|
||||||
|
def get_last_start_page_from_text(text):
|
||||||
|
start_page = -1
|
||||||
|
# Find all matches of start_index tags
|
||||||
|
start_page_matches = re.finditer(r'<start_index_(\d+)>', text)
|
||||||
|
# Convert iterator to list and get the last match if any exist
|
||||||
|
matches_list = list(start_page_matches)
|
||||||
|
if matches_list:
|
||||||
|
start_page = int(matches_list[-1].group(1))
|
||||||
|
return start_page
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def sanitize_filename(filename, replacement='-'):
|
||||||
|
# In Linux, only '/' and '\0' (null) are invalid in filenames.
|
||||||
|
# Null can't be represented in strings, so we only handle '/'.
|
||||||
|
return filename.replace('/', replacement)
|
||||||
|
|
||||||
|
class JsonLogger:
|
||||||
|
def __init__(self, file_path):
|
||||||
|
# Extract PDF name without extension for logger name and filename
|
||||||
|
# pdf_name = os.path.splitext(os.path.basename(file_path))[0]
|
||||||
|
if isinstance(file_path, str):
|
||||||
|
pdf_name = os.path.splitext(os.path.basename(file_path))[0]
|
||||||
|
elif isinstance(file_path, BytesIO):
|
||||||
|
pdf_reader = PyPDF2.PdfReader(file_path)
|
||||||
|
meta = pdf_reader.metadata
|
||||||
|
pdf_name = meta.title if meta.title else 'Untitled'
|
||||||
|
pdf_name = sanitize_filename(pdf_name)
|
||||||
|
|
||||||
|
current_time = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||||
|
self.filename = f"{pdf_name}_{current_time}.json"
|
||||||
|
os.makedirs("./logs", exist_ok=True)
|
||||||
|
# Initialize empty list to store all messages
|
||||||
|
self.log_data = []
|
||||||
|
|
||||||
|
def log(self, level, message, **kwargs):
|
||||||
|
if isinstance(message, dict):
|
||||||
|
self.log_data.append(message)
|
||||||
|
else:
|
||||||
|
self.log_data.append({'message': message})
|
||||||
|
# Add new message to the log data
|
||||||
|
|
||||||
|
# Write entire log data to file
|
||||||
|
with open(self._filepath(), "w") as f:
|
||||||
|
json.dump(self.log_data, f, indent=2)
|
||||||
|
|
||||||
|
def info(self, message, **kwargs):
|
||||||
|
self.log("INFO", message, **kwargs)
|
||||||
|
|
||||||
|
def error(self, message, **kwargs):
|
||||||
|
self.log("ERROR", message, **kwargs)
|
||||||
|
|
||||||
|
def debug(self, message, **kwargs):
|
||||||
|
self.log("DEBUG", message, **kwargs)
|
||||||
|
|
||||||
|
def exception(self, message, **kwargs):
|
||||||
|
kwargs["exception"] = True
|
||||||
|
self.log("ERROR", message, **kwargs)
|
||||||
|
|
||||||
|
def _filepath(self):
|
||||||
|
return os.path.join("logs", self.filename)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def list_to_tree(data):
|
||||||
|
def get_parent_structure(structure):
|
||||||
|
"""Helper function to get the parent structure code"""
|
||||||
|
if not structure:
|
||||||
|
return None
|
||||||
|
parts = str(structure).split('.')
|
||||||
|
return '.'.join(parts[:-1]) if len(parts) > 1 else None
|
||||||
|
|
||||||
|
# First pass: Create nodes and track parent-child relationships
|
||||||
|
nodes = {}
|
||||||
|
root_nodes = []
|
||||||
|
|
||||||
|
for item in data:
|
||||||
|
structure = item.get('structure')
|
||||||
|
node = {
|
||||||
|
'title': item.get('title'),
|
||||||
|
'start_index': item.get('start_index'),
|
||||||
|
'end_index': item.get('end_index'),
|
||||||
|
'child_nodes': []
|
||||||
|
}
|
||||||
|
|
||||||
|
nodes[structure] = node
|
||||||
|
|
||||||
|
# Find parent
|
||||||
|
parent_structure = get_parent_structure(structure)
|
||||||
|
|
||||||
|
if parent_structure:
|
||||||
|
# Add as child to parent if parent exists
|
||||||
|
if parent_structure in nodes:
|
||||||
|
nodes[parent_structure]['child_nodes'].append(node)
|
||||||
|
else:
|
||||||
|
root_nodes.append(node)
|
||||||
|
else:
|
||||||
|
# No parent, this is a root node
|
||||||
|
root_nodes.append(node)
|
||||||
|
|
||||||
|
# Helper function to clean empty children arrays
|
||||||
|
def clean_node(node):
|
||||||
|
if not node['child_nodes']:
|
||||||
|
del node['child_nodes']
|
||||||
|
else:
|
||||||
|
for child in node['child_nodes']:
|
||||||
|
clean_node(child)
|
||||||
|
return node
|
||||||
|
|
||||||
|
# Clean and return the tree
|
||||||
|
return [clean_node(node) for node in root_nodes]
|
||||||
|
|
||||||
|
def add_preface_if_needed(data):
|
||||||
|
if not isinstance(data, list) or not data:
|
||||||
|
return data
|
||||||
|
|
||||||
|
if data[0]['physical_index'] is not None and data[0]['physical_index'] > 1:
|
||||||
|
preface_node = {
|
||||||
|
"structure": "0",
|
||||||
|
"title": "Preface",
|
||||||
|
"physical_index": 1,
|
||||||
|
}
|
||||||
|
data.insert(0, preface_node)
|
||||||
|
return data
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def get_page_tokens(pdf_path, model="gpt-4o-2024-11-20", pdf_parser="PyPDF2"):
|
||||||
|
if pdf_parser == "PyPDF2":
|
||||||
|
pdf_reader = PyPDF2.PdfReader(pdf_path)
|
||||||
|
elif pdf_parser == "PyMuPDF":
|
||||||
|
pdf_reader = pymupdf.open(pdf_path)
|
||||||
|
else:
|
||||||
|
raise ValueError(f"Unsupported PDF parser: {pdf_parser}")
|
||||||
|
|
||||||
|
enc = tiktoken.encoding_for_model(model)
|
||||||
|
|
||||||
|
page_list = []
|
||||||
|
for page_num in range(len(pdf_reader.pages)):
|
||||||
|
page = pdf_reader.pages[page_num]
|
||||||
|
page_text = page.extract_text()
|
||||||
|
token_length = len(enc.encode(page_text))
|
||||||
|
page_list.append((page_text, token_length))
|
||||||
|
|
||||||
|
return page_list
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def get_text_of_pdf_pages(pdf_pages, start_page, end_page):
|
||||||
|
text = ""
|
||||||
|
for page_num in range(start_page-1, end_page):
|
||||||
|
text += pdf_pages[page_num]
|
||||||
|
return text
|
||||||
|
|
||||||
|
def get_number_of_pages(pdf_path):
|
||||||
|
pdf_reader = PyPDF2.PdfReader(pdf_path)
|
||||||
|
num = len(pdf_reader.pages)
|
||||||
|
return num
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def post_processing(structure, end_physical_index):
|
||||||
|
# First convert page_number to start_index in flat list
|
||||||
|
for i, item in enumerate(structure):
|
||||||
|
item['start_index'] = item.get('physical_index')
|
||||||
|
if i < len(structure) - 1:
|
||||||
|
if structure[i + 1].get('appear_start') == 'yes':
|
||||||
|
item['end_index'] = structure[i + 1]['physical_index']-1
|
||||||
|
else:
|
||||||
|
item['end_index'] = structure[i + 1]['physical_index']
|
||||||
|
else:
|
||||||
|
item['end_index'] = end_physical_index
|
||||||
|
tree = list_to_tree(structure)
|
||||||
|
if len(tree)!=0:
|
||||||
|
return tree
|
||||||
|
else:
|
||||||
|
### remove appear_start
|
||||||
|
for node in structure:
|
||||||
|
node.pop('appear_start', None)
|
||||||
|
node.pop('physical_index', None)
|
||||||
|
return structure
|
||||||
|
|
||||||
|
def clean_structure_post(data):
|
||||||
|
if isinstance(data, dict):
|
||||||
|
data.pop('page_number', None)
|
||||||
|
data.pop('start_index', None)
|
||||||
|
data.pop('end_index', None)
|
||||||
|
if 'child_nodes' in data:
|
||||||
|
clean_structure_post(data['child_nodes'])
|
||||||
|
elif isinstance(data, list):
|
||||||
|
for section in data:
|
||||||
|
clean_structure_post(section)
|
||||||
|
return data
|
||||||
|
|
||||||
|
|
||||||
|
def remove_structure_text(data):
|
||||||
|
if isinstance(data, dict):
|
||||||
|
data.pop('text', None)
|
||||||
|
if 'child_nodes' in data:
|
||||||
|
remove_structure_text(data['child_nodes'])
|
||||||
|
elif isinstance(data, list):
|
||||||
|
for item in data:
|
||||||
|
remove_structure_text(item)
|
||||||
|
return data
|
||||||
|
|
||||||
|
|
||||||
|
def check_token_limit(structure, limit=110000):
|
||||||
|
list = structure_to_list(structure)
|
||||||
|
for node in list:
|
||||||
|
num_tokens = count_tokens(node['text'], model='gpt-4o')
|
||||||
|
if num_tokens > limit:
|
||||||
|
print(f"Node ID: {node['node_id']} has {num_tokens} tokens")
|
||||||
|
print("Start Index:", node['start_index'])
|
||||||
|
print("End Index:", node['end_index'])
|
||||||
|
print("Title:", node['title'])
|
||||||
|
# print(node['text'])
|
||||||
|
print("\n")
|
||||||
|
|
||||||
|
|
||||||
|
def convert_physical_index_to_int(data):
|
||||||
|
if isinstance(data, list):
|
||||||
|
for i in range(len(data)):
|
||||||
|
if isinstance(data[i]['physical_index'], str):
|
||||||
|
if data[i]['physical_index'].startswith('<physical_index_'):
|
||||||
|
data[i]['physical_index'] = int(data[i]['physical_index'].split('_')[-1].rstrip('>').strip())
|
||||||
|
elif data[i]['physical_index'].startswith('physical_index_'):
|
||||||
|
data[i]['physical_index'] = int(data[i]['physical_index'].split('_')[-1].strip())
|
||||||
|
elif isinstance(data, str):
|
||||||
|
if data.startswith('<physical_index_'):
|
||||||
|
data = int(data.split('_')[-1].rstrip('>').strip())
|
||||||
|
elif data.startswith('physical_index_'):
|
||||||
|
data = int(data.split('_')[-1].strip())
|
||||||
|
###check data is int
|
||||||
|
if isinstance(data, int):
|
||||||
|
return data
|
||||||
|
else:
|
||||||
|
return None
|
||||||
|
return data
|
||||||
|
|
||||||
|
|
||||||
|
def convert_page_to_int(data):
|
||||||
|
for item in data:
|
||||||
|
if 'page' in item and isinstance(item['page'], str):
|
||||||
|
try:
|
||||||
|
item['page'] = int(item['page'])
|
||||||
|
except ValueError:
|
||||||
|
# Keep original value if conversion fails
|
||||||
|
pass
|
||||||
|
return data
|
||||||
Loading…
Add table
Add a link
Reference in a new issue