first commit

2026-04-24 23:56:21 +02:00 · 2025-04-01 18:54:08 +08:00 · 2025-04-01 18:54:08 +08:00 · 6f43b477d3
commit 6f43b477d3
17 changed files with 4529 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,15 @@
+.ipynb_checkpoints
+__pycache__
+files
+index
+temp/*
+chroma-collections.parquet
+chroma-embeddings.parquet
+.DS_Store
+.env*
+notebook
+SDK/*
+log/*
+logs/
+parts/*
+json_results/*
--- a/21
+++ b/21
@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2025 Vectify AI
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/README.md
+++ b/README.md
@ -0,0 +1,136 @@
+# PageIndex
+
+### **Document Index System for Reasoning-Based RAG**
+
+Traditional vector-based retrieval relies heavily on semantic similarity. But when working with professional documents that require domain expertise and multi-step reasoning, similarity search often falls short.
+
+**Reasoning-Based RAG** offers a better alternative: enabling LLMs to *think* and *reason* their way to the most relevant document sections. Inspired by **AlphaGo**, we leverage **tree search** to perform structured document retrieval.
+
+**PageIndex** is an indexing system that builds search trees from long documents, making them ready for reasoning-based RAG.
+
+Built by [Vectify AI](https://vectify.ai/pageindex)
+
+---
+
+## 🔍 What is PageIndex?
+
+**PageIndex** transforms lengthy PDF documents into a semantic **tree structure**, similar to a "table of contents" but optimized for use with Large Language Models (LLMs).
+It’s ideal for: financial reports, regulatory filings, academic textbooks, legal or technical manuals or any document that exceeds LLM context limits.
+
+### ✅ Key Features
+
+- **Scales to Massive Documents**  
+  Designed to handle hundreds or even thousands of pages with ease.
+    
+- **Hierarchical Tree Structure**  
+  Enables LLMs to traverse documents logically—like an intelligent, LLM-optimized table of contents.
+
+- **Precise Page Referencing**  
+  Every node contains its own summary and start/end page physical index, allowing pinpoint retrieval.
+
+- **Chunk-Free Segmentation**  
+  No arbitrary chunking. Nodes follow the natural structure of the document.
+
+---
+
+## 📦 PageIndex Format
+
+Here is an example output. See more [example documents](https://github.com/VectifyAI/PageIndex/tree/main/docs) and [generated trees](https://github.com/VectifyAI/PageIndex/tree/main/results).
+
+```json
+{
+  "title": "Financial Stability",
+  "node_id": "0006",
+  "start_index": 21,
+  "end_index": 22,
+  "summary": "The Federal Reserve ...",
+  "child_nodes": [
+    {
+      "title": "Monitoring Financial Vulnerabilities",
+      "node_id": "0007",
+      "start_index": 22,
+      "end_index": 28,
+      "summary": "The Federal Reserve's monitoring ..."
+    },
+    {
+      "title": "Domestic and International Cooperation and Coordination",
+      "node_id": "0008",
+      "start_index": 28,
+      "end_index": 31,
+      "summary": "In 2023, the Federal Reserve collaborated ..."
+    }
+  ]
+}
+
+```
+Notice: the node_id and summary generation function will be added soon.
+
+## 🧠 Reasoning-Based RAG with PageIndex
+
+Use PageIndex to build **reasoning-based retrieval systems** without relying on semantic similarity. Great for domain-specific tasks where nuance matters.
+
+### 🛠️ Example Prompt
+
+```python
+prompt = f"""
+You are given a question and a tree structure of a document.
+You need to find all nodes that are likely to contain the answer.
+
+Question: {question}
+
+Document tree structure: {structure}
+
+Reply in the following JSON format:
+{{
+  "thinking": <reasoning about where to look>,
+  "node_list": [node_id1, node_id2, ...]
+}}
+"""
+```
+
+## 🚀 Usage
+
+Follow these steps to generate a PageIndex tree from a PDF document.
+
+### 1. Install dependencies
+
+```bash
+pip3 install -r requirements.txt
+```
+
+### 2. Set your OpenAI API key
+
+Create a `.env` file in the root directory and add your API key:
+
+```bash
+CHATGPT_API_KEY=your_openai_key_here
+```
+
+### 3. Run PageIndex on your PDF
+
+```bash
+python3 page_index.py --pdf_path /path/to/your/document.pdf
+```
+
+The results will be saved in the `./results/` directory.
+
+## 🛤 Roadmap
+
+- [ ]  Add node summary and document selection
+- [ ]  Technical report on PageIndex design
+- [ ]  Efficient tree search algorithms for large documents
+- [ ]  Integration with vector-based semantic retrieval
+
+## 📈 Case Study: Mafin 2.5
+
+[Mafin 2.5](https://vectify.ai/blog/Mafin2.5) is a state-of-the-art reasoning-based RAG model designed specifically for financial document analysis. Built on top of **PageIndex**, it achieved an impressive **98.7% accuracy** on the [FinanceBench](https://github.com/VectifyAI/Mafin2.5-FinanceBench) benchmark—significantly outperforming traditional vector-based RAG systems.
+
+PageIndex’s hierarchical indexing enabled precise navigation and extraction of relevant content from complex financial reports, such as SEC filings and earnings disclosures.
+
+👉 See full [benchmark results](https://github.com/VectifyAI/Mafin2.5-FinanceBench) for detailed comparisons and performance metrics.
+
+## 📬 Contact Us
+
+Need customized support for your documents or reasoning-based RAG system?
+
+👉 [Contact us here](https://ii2abc2jejf.typeform.com/to/meB40zV0)
--- a/init.py
+++ b/init.py
--- a/docs/2023-annual-report.pdf
+++ b/docs/2023-annual-report.pdf
--- a/docs/PRML.pdf
+++ b/docs/PRML.pdf
--- a/docs/Regulation
+++ b/docs/Regulation
--- a/docs/Regulation
+++ b/docs/Regulation
--- a/docs/q1-fy25-earnings.pdf
+++ b/docs/q1-fy25-earnings.pdf
--- a/page_index.py
+++ b/page_index.py
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,5 @@
+openai==1.70.0
+pymupdf==1.25.5
+PyPDF2==3.0.1
+python-dotenv==1.1.0
+tiktoken==0.7.0
--- a/results/2023-annual-report_structure.json
+++ b/results/2023-annual-report_structure.json
@ -0,0 +1,460 @@
+[
+  {
+    "title": "Preface",
+    "start_index": 1,
+    "end_index": 4
+  },
+  {
+    "title": "About the Federal Reserve",
+    "start_index": 5,
+    "end_index": 7
+  },
+  {
+    "title": "Overview",
+    "start_index": 7,
+    "end_index": 8
+  },
+  {
+    "title": "Monetary Policy and Economic Developments",
+    "start_index": 9,
+    "end_index": 9,
+    "child_nodes": [
+      {
+        "title": "March 2024 Summary",
+        "start_index": 9,
+        "end_index": 14
+      },
+      {
+        "title": "June 2023 Summary",
+        "start_index": 15,
+        "end_index": 20
+      }
+    ]
+  },
+  {
+    "title": "Financial Stability",
+    "start_index": 21,
+    "end_index": 21,
+    "child_nodes": [
+      {
+        "title": "Monitoring Financial Vulnerabilities",
+        "start_index": 22,
+        "end_index": 28
+      },
+      {
+        "title": "Domestic and International Cooperation and Coordination",
+        "start_index": 28,
+        "end_index": 31
+      }
+    ]
+  },
+  {
+    "title": "Supervision and Regulation",
+    "start_index": 31,
+    "end_index": 31,
+    "child_nodes": [
+      {
+        "title": "Supervised and Regulated Institutions",
+        "start_index": 32,
+        "end_index": 35
+      },
+      {
+        "title": "Supervisory Developments",
+        "start_index": 35,
+        "end_index": 54
+      },
+      {
+        "title": "Regulatory Developments",
+        "start_index": 55,
+        "end_index": 59
+      }
+    ]
+  },
+  {
+    "title": "Payment System and Reserve Bank Oversight",
+    "start_index": 59,
+    "end_index": 59,
+    "child_nodes": [
+      {
+        "title": "Payment Services to Depository and Other Institutions",
+        "start_index": 60,
+        "end_index": 65
+      },
+      {
+        "title": "Currency and Coin",
+        "start_index": 66,
+        "end_index": 68
+      },
+      {
+        "title": "Fiscal Agency and Government Depository Services",
+        "start_index": 69,
+        "end_index": 72
+      },
+      {
+        "title": "Evolutions and Improvements to the System",
+        "start_index": 72,
+        "end_index": 75
+      },
+      {
+        "title": "Oversight of Federal Reserve Banks",
+        "start_index": 75,
+        "end_index": 81
+      },
+      {
+        "title": "Pro Forma Financial Statements for Federal Reserve Priced Services",
+        "start_index": 82,
+        "end_index": 88
+      }
+    ]
+  },
+  {
+    "title": "Consumer and Community Affairs",
+    "start_index": 89,
+    "end_index": 89,
+    "child_nodes": [
+      {
+        "title": "Consumer Compliance Supervision",
+        "start_index": 89,
+        "end_index": 101
+      },
+      {
+        "title": "Consumer Laws and Regulations",
+        "start_index": 101,
+        "end_index": 102
+      },
+      {
+        "title": "Consumer Research and Analysis of Emerging Issues and Policy",
+        "start_index": 102,
+        "end_index": 105
+      },
+      {
+        "title": "Community Development",
+        "start_index": 105,
+        "end_index": 106
+      }
+    ]
+  },
+  {
+    "title": "Appendixes",
+    "start_index": 107,
+    "end_index": 108
+  },
+  {
+    "title": "Federal Reserve System Organization",
+    "start_index": 109,
+    "end_index": 109,
+    "child_nodes": [
+      {
+        "title": "Board of Governors",
+        "start_index": 109,
+        "end_index": 116
+      },
+      {
+        "title": "Federal Open Market Committee",
+        "start_index": 117,
+        "end_index": 118
+      },
+      {
+        "title": "Board of Governors Advisory Councils",
+        "start_index": 119,
+        "end_index": 122
+      },
+      {
+        "title": "Federal Reserve Banks and Branches",
+        "start_index": 123,
+        "end_index": 146
+      }
+    ]
+  },
+  {
+    "title": "Minutes of Federal Open Market Committee Meetings",
+    "start_index": 147,
+    "end_index": 147,
+    "child_nodes": [
+      {
+        "title": "Meeting Minutes",
+        "start_index": 147,
+        "end_index": 149
+      }
+    ]
+  },
+  {
+    "title": "Federal Reserve System Audits",
+    "start_index": 149,
+    "end_index": 149,
+    "child_nodes": [
+      {
+        "title": "Office of Inspector General Activities",
+        "start_index": 149,
+        "end_index": 151
+      },
+      {
+        "title": "Government Accountability Office Reviews",
+        "start_index": 151,
+        "end_index": 153
+      }
+    ]
+  },
+  {
+    "title": "Federal Reserve System Budgets",
+    "start_index": 153,
+    "end_index": 153,
+    "child_nodes": [
+      {
+        "title": "System Budgets Overview",
+        "start_index": 153,
+        "end_index": 157
+      },
+      {
+        "title": "Board of Governors Budgets",
+        "start_index": 157,
+        "end_index": 163
+      },
+      {
+        "title": "Federal Reserve Banks Budgets",
+        "start_index": 163,
+        "end_index": 169
+      },
+      {
+        "title": "Currency Budget",
+        "start_index": 169,
+        "end_index": 174
+      }
+    ]
+  },
+  {
+    "title": "Record of Policy Actions of the Board of Governors",
+    "start_index": 175,
+    "end_index": 175,
+    "child_nodes": [
+      {
+        "title": "Rules and Regulations",
+        "start_index": 175,
+        "end_index": 176
+      },
+      {
+        "title": "Policy Statements and Other Actions",
+        "start_index": 177,
+        "end_index": 181
+      },
+      {
+        "title": "Discount Rates for Depository Institutions in 2023",
+        "start_index": 181,
+        "end_index": 183
+      },
+      {
+        "title": "The Board of Governors and the Government Performance and Results Act",
+        "start_index": 184,
+        "end_index": 184
+      }
+    ]
+  },
+  {
+    "title": "Litigation",
+    "start_index": 185,
+    "end_index": 185,
+    "child_nodes": [
+      {
+        "title": "Pending",
+        "start_index": 185,
+        "end_index": 186
+      },
+      {
+        "title": "Resolved",
+        "start_index": 186,
+        "end_index": 186
+      }
+    ]
+  },
+  {
+    "title": "Statistical Tables",
+    "start_index": 187,
+    "end_index": 187,
+    "child_nodes": [
+      {
+        "title": "Federal Reserve open market transactions, 2023",
+        "start_index": 187,
+        "end_index": 187,
+        "child_nodes": [
+          {
+            "title": "Federal Reserve open market transactions, 2023\u2014continued",
+            "start_index": 187,
+            "end_index": 188
+          }
+        ]
+      },
+      {
+        "title": "Federal Reserve Bank holdings of U.S. Treasury and federal agency securities, December 31, 2021\u201323",
+        "start_index": 189,
+        "end_index": 188,
+        "child_nodes": [
+          {
+            "title": "Federal Reserve Bank holdings of U.S. Treasury and federal agency securities, December 31, 2021\u201323\u2014continued",
+            "start_index": 189,
+            "end_index": 190
+          }
+        ]
+      },
+      {
+        "title": "Reserve requirements of depository institutions, December 31, 2023",
+        "start_index": 191,
+        "end_index": 191
+      },
+      {
+        "title": "Banking offices and banks affiliated with bank holding companies in the United States, December 31, 2022 and 2023",
+        "start_index": 192,
+        "end_index": 192
+      },
+      {
+        "title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1984\u20132023 and month-end 2023",
+        "start_index": 193,
+        "end_index": 194,
+        "child_nodes": [
+          {
+            "title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1984\u20132023 and month-end 2023\u2014continued",
+            "start_index": 194,
+            "end_index": 194
+          },
+          {
+            "title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1984\u20132023 and month-end 2023\u2014continued",
+            "start_index": 195,
+            "end_index": 196
+          },
+          {
+            "title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1984\u20132023 and month-end 2023\u2014continued",
+            "start_index": 196,
+            "end_index": 196
+          }
+        ]
+      },
+      {
+        "title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1918\u20131983",
+        "start_index": 197,
+        "end_index": 198,
+        "child_nodes": [
+          {
+            "title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1918\u20131983\u2014continued",
+            "start_index": 199,
+            "end_index": 198
+          },
+          {
+            "title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1918\u20131983\u2014continued",
+            "start_index": 199,
+            "end_index": 198
+          },
+          {
+            "title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1918\u20131983\u2014continued",
+            "start_index": 199,
+            "end_index": 200
+          }
+        ]
+      },
+      {
+        "title": "Principal assets and liabilities of insured commercial banks, by class of bank, June 30, 2023 and 2022",
+        "start_index": 201,
+        "end_index": 201
+      },
+      {
+        "title": "Initial margin requirements under Regulations T, U, and X",
+        "start_index": 202,
+        "end_index": 203
+      },
+      {
+        "title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022",
+        "start_index": 203,
+        "end_index": 206,
+        "child_nodes": [
+          {
+            "title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022\u2014continued",
+            "start_index": 206,
+            "end_index": 206
+          },
+          {
+            "title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022\u2014continued",
+            "start_index": 206,
+            "end_index": 206
+          },
+          {
+            "title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022\u2014continued",
+            "start_index": 206,
+            "end_index": 206
+          },
+          {
+            "title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022\u2014continued",
+            "start_index": 206,
+            "end_index": 206
+          },
+          {
+            "title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022\u2014continued",
+            "start_index": 206,
+            "end_index": 209
+          }
+        ]
+      },
+      {
+        "title": "Statement of condition of the Federal Reserve Banks, December 31, 2023 and 2022",
+        "start_index": 209,
+        "end_index": 210
+      },
+      {
+        "title": "Income and expenses of the Federal Reserve Banks, by Bank, 2023",
+        "start_index": 210,
+        "end_index": 211,
+        "child_nodes": [
+          {
+            "title": "Income and expenses of the Federal Reserve Banks, by Bank, 2023\u2014continued",
+            "start_index": 211,
+            "end_index": 212
+          },
+          {
+            "title": "Income and expenses of the Federal Reserve Banks, by Bank, 2023\u2014continued",
+            "start_index": 212,
+            "end_index": 212
+          },
+          {
+            "title": "Income and expenses of the Federal Reserve Banks, by Bank, 2023\u2014continued",
+            "start_index": 212,
+            "end_index": 214
+          }
+        ]
+      },
+      {
+        "title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023",
+        "start_index": 214,
+        "end_index": 214,
+        "child_nodes": [
+          {
+            "title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023\u2014continued",
+            "start_index": 214,
+            "end_index": 214
+          },
+          {
+            "title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023\u2014continued",
+            "start_index": 214,
+            "end_index": 217
+          },
+          {
+            "title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023\u2014continued",
+            "start_index": 217,
+            "end_index": 217
+          }
+        ]
+      },
+      {
+        "title": "Operations in principal departments of the Federal Reserve Banks, 2020\u201323",
+        "start_index": 218,
+        "end_index": 218
+      },
+      {
+        "title": "Number and annual salaries of officers and employees of the Federal Reserve Banks, December 31, 2023",
+        "start_index": 219,
+        "end_index": 220
+      },
+      {
+        "title": "Acquisition costs and net book value of the premises of the Federal Reserve Banks and Branches, December 31, 2023",
+        "start_index": 220,
+        "end_index": 222
+      }
+    ]
+  }
+]
--- a/results/PRML_structure.json
+++ b/results/PRML_structure.json
--- a/release_structure.json
+++ b/release_structure.json
@ -0,0 +1,51 @@
+[
+  {
+    "title": "Preface",
+    "start_index": 1,
+    "end_index": 2
+  },
+  {
+    "title": "Introduction",
+    "start_index": 2,
+    "end_index": 6
+  },
+  {
+    "title": "Interpretation and Application",
+    "start_index": 6,
+    "end_index": 8,
+    "child_nodes": [
+      {
+        "title": "Historical Context and Legislative History",
+        "start_index": 8,
+        "end_index": 10
+      },
+      {
+        "title": "Scope of the Solely Incidental Prong of the Broker-Dealer Exclusion",
+        "start_index": 10,
+        "end_index": 14
+      },
+      {
+        "title": "Guidance on Applying the Interpretation of the Solely Incidental Prong",
+        "start_index": 14,
+        "end_index": 22
+      }
+    ]
+  },
+  {
+    "title": "Economic Considerations",
+    "start_index": 22,
+    "end_index": 22,
+    "child_nodes": [
+      {
+        "title": "Background",
+        "start_index": 22,
+        "end_index": 23
+      },
+      {
+        "title": "Potential Economic Effects",
+        "start_index": 23,
+        "end_index": 28
+      }
+    ]
+  }
+]
--- a/rule_structure.json
+++ b/rule_structure.json
@ -0,0 +1,466 @@
+[
+  {
+    "title": "Preface",
+    "start_index": 1,
+    "end_index": 6
+  },
+  {
+    "title": "INTRODUCTION",
+    "start_index": 6,
+    "end_index": 12,
+    "child_nodes": [
+      {
+        "title": "Background",
+        "start_index": 12,
+        "end_index": 22,
+        "child_nodes": [
+          {
+            "title": "Evaluation of Standards of Conduct Applicable to Investment Advice",
+            "start_index": 22,
+            "end_index": 26
+          },
+          {
+            "title": "DOL Rulemaking",
+            "start_index": 26,
+            "end_index": 32
+          },
+          {
+            "title": "Statement by Chairman Clayton",
+            "start_index": 32,
+            "end_index": 36
+          }
+        ]
+      },
+      {
+        "title": "General Objectives of Proposed Approach",
+        "start_index": 36,
+        "end_index": 44
+      }
+    ]
+  },
+  {
+    "title": "DISCUSSION OF REGULATION BEST INTEREST",
+    "start_index": 44,
+    "end_index": 44,
+    "child_nodes": [
+      {
+        "title": "Overview of Regulation Best Interest",
+        "start_index": 44,
+        "end_index": 50
+      },
+      {
+        "title": "Best Interest, Generally",
+        "start_index": 50,
+        "end_index": 58,
+        "child_nodes": [
+          {
+            "title": "Consistency with Other Approaches",
+            "start_index": 58,
+            "end_index": 66
+          },
+          {
+            "title": "Request for Comment on the Best Interest Obligation",
+            "start_index": 66,
+            "end_index": 71
+          }
+        ]
+      },
+      {
+        "title": "Key Terms and Scope of Best Interest Obligation",
+        "start_index": 71,
+        "end_index": 71,
+        "child_nodes": [
+          {
+            "title": "Natural Person who is an Associated Person",
+            "start_index": 71,
+            "end_index": 72
+          },
+          {
+            "title": "When Making a Recommendation, At Time Recommendation is Made",
+            "start_index": 72,
+            "end_index": 82
+          },
+          {
+            "title": "Any Securities Transaction or Investment Strategy",
+            "start_index": 82,
+            "end_index": 83
+          },
+          {
+            "title": "Retail Customer",
+            "start_index": 83,
+            "end_index": 90
+          },
+          {
+            "title": "Request for Comment on Key Terms and Scope of Best Interest Obligation",
+            "start_index": 90,
+            "end_index": 96
+          }
+        ]
+      },
+      {
+        "title": "Components of Regulation Best Interest",
+        "start_index": 96,
+        "end_index": 97,
+        "child_nodes": [
+          {
+            "title": "Disclosure Obligation",
+            "start_index": 97,
+            "end_index": 133
+          },
+          {
+            "title": "Care Obligation",
+            "start_index": 133,
+            "end_index": 166
+          },
+          {
+            "title": "Conflict of Interest Obligations",
+            "start_index": 166,
+            "end_index": 196
+          }
+        ]
+      },
+      {
+        "title": "Recordkeeping and Retention",
+        "start_index": 196,
+        "end_index": 199
+      },
+      {
+        "title": "Whether the Exercise of Investment Discretion Should be Viewed as Solely Incidental to the Business of a Broker or Dealer",
+        "start_index": 199,
+        "end_index": 209
+      }
+    ]
+  },
+  {
+    "title": "REQUEST FOR COMMENT",
+    "start_index": 209,
+    "end_index": 210,
+    "child_nodes": [
+      {
+        "title": "Generally",
+        "start_index": 210,
+        "end_index": 212
+      },
+      {
+        "title": "Interactions with Other Standards of Conduct",
+        "start_index": 212,
+        "end_index": 214
+      }
+    ]
+  },
+  {
+    "title": "ECONOMIC ANALYSIS",
+    "start_index": 214,
+    "end_index": 214,
+    "child_nodes": [
+      {
+        "title": "Introduction, Primary Goals of Proposed Regulations and Broad Economic Considerations",
+        "start_index": 214,
+        "end_index": 214,
+        "child_nodes": [
+          {
+            "title": "Introduction and Primary Goals of Proposed Regulation",
+            "start_index": 214,
+            "end_index": 215
+          },
+          {
+            "title": "Broad Economic Considerations",
+            "start_index": 215,
+            "end_index": 225
+          }
+        ]
+      },
+      {
+        "title": "Economic Baseline",
+        "start_index": 225,
+        "end_index": 225,
+        "child_nodes": [
+          {
+            "title": "Market for Advice Services",
+            "start_index": 225,
+            "end_index": 246
+          },
+          {
+            "title": "Regulatory Baseline",
+            "start_index": 246,
+            "end_index": 255
+          }
+        ]
+      },
+      {
+        "title": "Benefits, Costs, and Effects on Efficiency, Competition, and Capital Formation",
+        "start_index": 255,
+        "end_index": 258,
+        "child_nodes": [
+          {
+            "title": "Benefits",
+            "start_index": 258,
+            "end_index": 272
+          },
+          {
+            "title": "Costs",
+            "start_index": 272,
+            "end_index": 275,
+            "child_nodes": [
+              {
+                "title": "Standard of Conduct Defined as Best Interest",
+                "start_index": 275,
+                "end_index": 275,
+                "child_nodes": [
+                  {
+                    "title": "Operational Costs",
+                    "start_index": 275,
+                    "end_index": 277
+                  },
+                  {
+                    "title": "Programmatic Costs",
+                    "start_index": 278,
+                    "end_index": 280
+                  }
+                ]
+              },
+              {
+                "title": "Disclosure Obligation",
+                "start_index": 280,
+                "end_index": 286
+              },
+              {
+                "title": "Obligation to Exercise Reasonable Diligence, Care, Skill, and Prudence in Making a Recommendation",
+                "start_index": 286,
+                "end_index": 290
+              },
+              {
+                "title": "Obligation to Establish, Maintain, and Enforce Written Policies and Procedures Reasonably Designed to Identify and at a Minimum Disclose, or Eliminate, All Material Conflicts of Interest Associated with a Recommendation",
+                "start_index": 290,
+                "end_index": 295,
+                "child_nodes": [
+                  {
+                    "title": "Eliminate Material Conflicts of Interest Associated with a Recommendation",
+                    "start_index": 295,
+                    "end_index": 297
+                  },
+                  {
+                    "title": "At a Minimum Disclose Material Conflicts of Interest Associated with a Recommendation",
+                    "start_index": 297,
+                    "end_index": 299
+                  }
+                ]
+              },
+              {
+                "title": "Obligation to Establish, Maintain, and Enforce Written Policies and Procedures Reasonably Designed to Identify and Disclose and Mitigate, or Eliminate, Material Conflicts of Interest Arising from Financial Incentives Associated with a Recommendation",
+                "start_index": 299,
+                "end_index": 300,
+                "child_nodes": [
+                  {
+                    "title": "Eliminate Material Conflicts Arising from Financial Incentives Associated with a Recommendation",
+                    "start_index": 300,
+                    "end_index": 304
+                  },
+                  {
+                    "title": "Disclose and Mitigate Material Conflicts of Interest Arising from Financial Incentives Associated with a Recommendation",
+                    "start_index": 304,
+                    "end_index": 316
+                  }
+                ]
+              }
+            ]
+          }
+        ]
+      },
+      {
+        "title": "Effects on Efficiency, Competition, and Capital Formation",
+        "start_index": 316,
+        "end_index": 324
+      },
+      {
+        "title": "Reasonable Alternatives",
+        "start_index": 324,
+        "end_index": 325,
+        "child_nodes": [
+          {
+            "title": "Disclosure-Only Alternative",
+            "start_index": 325,
+            "end_index": 327
+          },
+          {
+            "title": "Principles-Based Standard of Conduct Obligation",
+            "start_index": 327,
+            "end_index": 328
+          },
+          {
+            "title": "A Fiduciary Standard for Broker-Dealers",
+            "start_index": 328,
+            "end_index": 332
+          },
+          {
+            "title": "Enhanced Standards Akin to Conditions of the BIC Exemption",
+            "start_index": 332,
+            "end_index": 335
+          }
+        ]
+      },
+      {
+        "title": "Request for Comment",
+        "start_index": 335,
+        "end_index": 338
+      }
+    ]
+  },
+  {
+    "title": "PAPERWORK REDUCTION ACT ANALYSIS",
+    "start_index": 338,
+    "end_index": 340,
+    "child_nodes": [
+      {
+        "title": "Respondents Subject to Proposed Regulation Best Interest and Proposed Amendments to Rule 17a-3(a)(25), Rule 17a-4(e)(5)",
+        "start_index": 340,
+        "end_index": 340,
+        "child_nodes": [
+          {
+            "title": "Broker-Dealers",
+            "start_index": 340,
+            "end_index": 340
+          },
+          {
+            "title": "Natural Persons Who Are Associated Persons of Broker-Dealers",
+            "start_index": 340,
+            "end_index": 341
+          }
+        ]
+      },
+      {
+        "title": "Summary of Collections of Information",
+        "start_index": 341,
+        "end_index": 342,
+        "child_nodes": [
+          {
+            "title": "Conflict of Interest Obligations",
+            "start_index": 342,
+            "end_index": 353
+          },
+          {
+            "title": "Disclosure Obligation",
+            "start_index": 353,
+            "end_index": 370
+          },
+          {
+            "title": "Care Obligation",
+            "start_index": 370,
+            "end_index": 370
+          },
+          {
+            "title": "Record-Making and Recordkeeping Obligations",
+            "start_index": 370,
+            "end_index": 375
+          }
+        ]
+      },
+      {
+        "title": "Collection of Information is Mandatory",
+        "start_index": 375,
+        "end_index": 375
+      },
+      {
+        "title": "Confidentiality",
+        "start_index": 375,
+        "end_index": 376
+      },
+      {
+        "title": "Request for Comment",
+        "start_index": 376,
+        "end_index": 377
+      }
+    ]
+  },
+  {
+    "title": "SMALL BUSINESS REGULATORY ENFORCEMENT FAIRNESS ACT",
+    "start_index": 377,
+    "end_index": 378
+  },
+  {
+    "title": "INITIAL REGULATORY FLEXIBILITY ACT ANALYSIS",
+    "start_index": 378,
+    "end_index": 379,
+    "child_nodes": [
+      {
+        "title": "Reasons for and Objectives of the Proposed Action",
+        "start_index": 379,
+        "end_index": 381
+      },
+      {
+        "title": "Legal Basis",
+        "start_index": 381,
+        "end_index": 381
+      },
+      {
+        "title": "Small Entities Subject to the Proposed Rule",
+        "start_index": 381,
+        "end_index": 382
+      },
+      {
+        "title": "Projected Compliance Requirements of the Proposed Rule for Small Entities",
+        "start_index": 382,
+        "end_index": 383,
+        "child_nodes": [
+          {
+            "title": "Conflict of Interest Obligations",
+            "start_index": 383,
+            "end_index": 386
+          },
+          {
+            "title": "Disclosure Obligations",
+            "start_index": 387,
+            "end_index": 394
+          },
+          {
+            "title": "Obligation to Exercise Reasonable Diligence, Care, Skill and Prudence",
+            "start_index": 394,
+            "end_index": 394
+          },
+          {
+            "title": "Record-Making and Recordkeeping Obligations",
+            "start_index": 394,
+            "end_index": 397
+          }
+        ]
+      },
+      {
+        "title": "Duplicative, Overlapping, or Conflicting Federal Rules",
+        "start_index": 397,
+        "end_index": 398
+      },
+      {
+        "title": "Significant Alternatives",
+        "start_index": 398,
+        "end_index": 401,
+        "child_nodes": [
+          {
+            "title": "Disclosure-Only Alternative",
+            "start_index": 401,
+            "end_index": 401
+          },
+          {
+            "title": "Principles-Based Alternative",
+            "start_index": 401,
+            "end_index": 402
+          },
+          {
+            "title": "Enhanced Standards Akin to BIC Exemption",
+            "start_index": 402,
+            "end_index": 403
+          }
+        ]
+      },
+      {
+        "title": "General Request for Comment",
+        "start_index": 403,
+        "end_index": 403
+      }
+    ]
+  },
+  {
+    "title": "STATUTORY AUTHORITY AND TEXT OF PROPOSED RULE",
+    "start_index": 403,
+    "end_index": 408
+  }
+]
--- a/results/q1-fy25-earnings_structure.json
+++ b/results/q1-fy25-earnings_structure.json
@ -0,0 +1,220 @@
+[
+  {
+    "title": "THE WALT DISNEY COMPANY REPORTS FIRST QUARTER EARNINGS FOR FISCAL 2025",
+    "start_index": 1,
+    "end_index": 1,
+    "child_nodes": [
+      {
+        "title": "Financial Results for the Quarter",
+        "start_index": 1,
+        "end_index": 1,
+        "child_nodes": [
+          {
+            "title": "Key Points",
+            "start_index": 1,
+            "end_index": 1
+          }
+        ]
+      },
+      {
+        "title": "Guidance and Outlook",
+        "start_index": 2,
+        "end_index": 2,
+        "child_nodes": [
+          {
+            "title": "Star India deconsolidated in Q1",
+            "start_index": 2,
+            "end_index": 2
+          },
+          {
+            "title": "Q2 Fiscal 2025",
+            "start_index": 2,
+            "end_index": 2
+          },
+          {
+            "title": "Fiscal Year 2025",
+            "start_index": 2,
+            "end_index": 2
+          }
+        ]
+      },
+      {
+        "title": "Message From Our CEO",
+        "start_index": 2,
+        "end_index": 2
+      },
+      {
+        "title": "SUMMARIZED FINANCIAL RESULTS",
+        "start_index": 3,
+        "end_index": 3,
+        "child_nodes": [
+          {
+            "title": "SUMMARIZED SEGMENT FINANCIAL RESULTS",
+            "start_index": 3,
+            "end_index": 3
+          }
+        ]
+      },
+      {
+        "title": "DISCUSSION OF FIRST QUARTER SEGMENT RESULTS",
+        "start_index": 4,
+        "end_index": 4,
+        "child_nodes": [
+          {
+            "title": "Star India",
+            "start_index": 4,
+            "end_index": 4
+          },
+          {
+            "title": "Entertainment",
+            "start_index": 4,
+            "end_index": 4,
+            "child_nodes": [
+              {
+                "title": "Linear Networks",
+                "start_index": 5,
+                "end_index": 5
+              },
+              {
+                "title": "Direct-to-Consumer",
+                "start_index": 5,
+                "end_index": 7
+              },
+              {
+                "title": "Content Sales/Licensing and Other",
+                "start_index": 7,
+                "end_index": 7
+              }
+            ]
+          },
+          {
+            "title": "Sports",
+            "start_index": 7,
+            "end_index": 7,
+            "child_nodes": [
+              {
+                "title": "Domestic ESPN",
+                "start_index": 8,
+                "end_index": 8
+              },
+              {
+                "title": "International ESPN",
+                "start_index": 8,
+                "end_index": 8
+              },
+              {
+                "title": "Star India",
+                "start_index": 8,
+                "end_index": 8
+              }
+            ]
+          },
+          {
+            "title": "Experiences",
+            "start_index": 9,
+            "end_index": 9,
+            "child_nodes": [
+              {
+                "title": "Domestic Parks and Experiences",
+                "start_index": 9,
+                "end_index": 9
+              },
+              {
+                "title": "International Parks and Experiences",
+                "start_index": 9,
+                "end_index": 9
+              }
+            ]
+          }
+        ]
+      },
+      {
+        "title": "OTHER FINANCIAL INFORMATION",
+        "start_index": 9,
+        "end_index": 9,
+        "child_nodes": [
+          {
+            "title": "Corporate and Unallocated Shared Expenses",
+            "start_index": 9,
+            "end_index": 9
+          },
+          {
+            "title": "Restructuring and Impairment Charges",
+            "start_index": 9,
+            "end_index": 9
+          },
+          {
+            "title": "Interest Expense, net",
+            "start_index": 10,
+            "end_index": 10
+          },
+          {
+            "title": "Equity in the Income of Investees",
+            "start_index": 10,
+            "end_index": 10
+          },
+          {
+            "title": "Income Taxes",
+            "start_index": 10,
+            "end_index": 10
+          },
+          {
+            "title": "Noncontrolling Interests",
+            "start_index": 11,
+            "end_index": 11
+          },
+          {
+            "title": "Cash from Operations",
+            "start_index": 11,
+            "end_index": 11
+          },
+          {
+            "title": "Capital Expenditures",
+            "start_index": 12,
+            "end_index": 12
+          },
+          {
+            "title": "Depreciation Expense",
+            "start_index": 12,
+            "end_index": 12
+          }
+        ]
+      },
+      {
+        "title": "THE WALT DISNEY COMPANY CONDENSED CONSOLIDATED STATEMENTS OF INCOME",
+        "start_index": 13,
+        "end_index": 13
+      },
+      {
+        "title": "THE WALT DISNEY COMPANY CONDENSED CONSOLIDATED BALANCE SHEETS",
+        "start_index": 14,
+        "end_index": 14
+      },
+      {
+        "title": "THE WALT DISNEY COMPANY CONDENSED CONSOLIDATED STATEMENTS OF CASH FLOWS",
+        "start_index": 15,
+        "end_index": 15
+      },
+      {
+        "title": "DTC PRODUCT DESCRIPTIONS AND KEY DEFINITIONS",
+        "start_index": 16,
+        "end_index": 16
+      },
+      {
+        "title": "NON-GAAP FINANCIAL MEASURES",
+        "start_index": 17,
+        "end_index": 20
+      },
+      {
+        "title": "FORWARD-LOOKING STATEMENTS",
+        "start_index": 21,
+        "end_index": 21
+      },
+      {
+        "title": "PREPARED EARNINGS REMARKS AND CONFERENCE CALL INFORMATION",
+        "start_index": 22,
+        "end_index": 22
+      }
+    ]
+  }
+]
--- a/utils.py
+++ b/utils.py
@ -0,0 +1,524 @@
+import tiktoken
+import openai
+import logging
+import os
+from datetime import datetime
+import time
+import json
+import PyPDF2
+import copy
+import asyncio
+import pymupdf
+from io import BytesIO
+import logging
+
+
+def count_tokens(text, model):
+    enc = tiktoken.encoding_for_model(model)
+    tokens = enc.encode(text)
+    return len(tokens)
+
+def ChatGPT_API_with_finish_reason(model, prompt, api_key, chat_history=None):
+    max_retries = 10
+    client = openai.OpenAI(api_key=api_key)
+    for i in range(max_retries):
+        try:
+            if chat_history:
+                messages = chat_history
+                messages.append({"role": "user", "content": prompt})
+            else:
+                messages = [{"role": "user", "content": prompt}]
+            
+            response = client.chat.completions.create(
+                model=model,
+                messages=messages,
+                temperature=0,
+            )
+            if response.choices[0].finish_reason == "length":
+                return response.choices[0].message.content, "max_output_reached"
+            else:
+                return response.choices[0].message.content, "finished"
+
+        except Exception as e:
+            print('************* Retrying *************')
+            logging.error(f"Error: {e}")
+            if i < max_retries - 1:
+                time.sleep(1)  # Wait for 1秒 before retrying
+            else:
+                logging.error('Max retries reached for prompt: ' + prompt)
+                return "Error"
+
+
+
+def ChatGPT_API(model, prompt, api_key, chat_history=None):
+    max_retries = 10
+    client = openai.OpenAI(api_key=api_key)
+    for i in range(max_retries):
+        try:
+            if chat_history:
+                messages = chat_history
+                messages.append({"role": "user", "content": prompt})
+            else:
+                messages = [{"role": "user", "content": prompt}]
+            
+            response = client.chat.completions.create(
+                model=model,
+                messages=messages,
+                temperature=0,
+            )
+   
+            return response.choices[0].message.content
+        except Exception as e:
+            print('************* Retrying *************')
+            logging.error(f"Error: {e}")
+            if i < max_retries - 1:
+                time.sleep(1)  # Wait for 1秒 before retrying
+            else:
+                logging.error('Max retries reached for prompt: ' + prompt)
+                return "Error"
+            
+
+async def ChatGPT_API_async(model, prompt, api_key):
+    max_retries = 10
+    client = openai.AsyncOpenAI(api_key=api_key)
+    for i in range(max_retries):
+        try:
+            messages = [{"role": "user", "content": prompt}]
+            response = await client.chat.completions.create(
+                model=model,
+                messages=messages,
+                temperature=0,
+            )
+            return response.choices[0].message.content
+        except Exception as e:
+            print('************* Retrying *************')
+            logging.error(f"Error: {e}")
+            if i < max_retries - 1:
+                await asyncio.sleep(1)  # Wait for 1秒 before retrying
+            else:
+                logging.error('Max retries reached for prompt: ' + prompt)
+                return "Error"  
+            
+def get_json_content(response):
+    start_idx = response.find("```json")
+    if start_idx != -1:
+        start_idx += 7
+        response = response[start_idx:]
+        
+    end_idx = response.rfind("```")
+    if end_idx != -1:
+        response = response[:end_idx]
+    
+    json_content = response.strip()
+    return json_content
+         
+
+def extract_json(content):
+    try:
+        # First, try to extract JSON enclosed within ```json and ```
+        start_idx = content.find("```json")
+        if start_idx != -1:
+            start_idx += 7  # Adjust index to start after the delimiter
+            end_idx = content.rfind("```")
+            json_content = content[start_idx:end_idx].strip()
+        else:
+            # If no delimiters, assume entire content could be JSON
+            json_content = content.strip()
+
+        # Clean up common issues that might cause parsing errors
+        json_content = json_content.replace('None', 'null')  # Replace Python None with JSON null
+        json_content = json_content.replace('\n', ' ').replace('\r', ' ')  # Remove newlines
+        json_content = ' '.join(json_content.split())  # Normalize whitespace
+
+        # Attempt to parse and return the JSON object
+        return json.loads(json_content)
+    except json.JSONDecodeError as e:
+        logging.error(f"Failed to extract JSON: {e}")
+        # Try to clean up the content further if initial parsing fails
+        try:
+            # Remove any trailing commas before closing brackets/braces
+            json_content = json_content.replace(',]', ']').replace(',}', '}')
+            return json.loads(json_content)
+        except:
+            logging.error("Failed to parse JSON even after cleanup")
+            return {}
+    except Exception as e:
+        logging.error(f"Unexpected error while extracting JSON: {e}")
+        return {}
+
+def write_node_id(data, node_id=0):
+    if isinstance(data, dict):
+        data['node_id'] = str(node_id).zfill(4)
+        node_id += 1
+        for key in list(data.keys()):
+            if 'child_nodes' in key:
+                node_id = write_node_id(data[key], node_id)
+    elif isinstance(data, list):
+        for index in range(len(data)):
+            node_id = write_node_id(data[index], node_id)
+    return node_id
+
+def get_nodes(structure):
+    if isinstance(structure, dict):
+        structure_node = copy.deepcopy(structure)
+        structure_node.pop('child_nodes', None)
+        nodes = [structure_node]
+        for key in list(structure.keys()):
+            if 'child_nodes' in key:
+                nodes.extend(get_nodes(structure[key]))
+        return nodes
+    elif isinstance(structure, list):
+        nodes = []
+        for item in structure:
+            nodes.extend(get_nodes(item))
+        return nodes
+    
+def structure_to_list(structure):
+    if isinstance(structure, dict):
+        nodes = []
+        nodes.append(structure)
+        if 'child_nodes' in structure:
+            nodes.extend(structure_to_list(structure['child_nodes']))
+        return nodes
+    elif isinstance(structure, list):
+        nodes = []
+        for item in structure:
+            nodes.extend(structure_to_list(item))
+        return nodes
+
+    
+def get_leaf_nodes(structure):
+    if isinstance(structure, dict):
+        if not structure['child_nodes']:
+            structure_node = copy.deepcopy(structure)
+            structure_node.pop('child_nodes', None)
+            return [structure_node]
+        else:
+            leaf_nodes = []
+            for key in list(structure.keys()):
+                if 'child_nodes' in key:
+                    leaf_nodes.extend(get_leaf_nodes(structure[key]))
+            return leaf_nodes
+    elif isinstance(structure, list):
+        leaf_nodes = []
+        for item in structure:
+            leaf_nodes.extend(get_leaf_nodes(item))
+        return leaf_nodes
+
+def is_leaf_node(data, node_id):
+    # Helper function to find the node by its node_id
+    def find_node(data, node_id):
+        if isinstance(data, dict):
+            if data.get('node_id') == node_id:
+                return data
+            for key in data.keys():
+                if 'child_nodes' in key:
+                    result = find_node(data[key], node_id)
+                    if result:
+                        return result
+        elif isinstance(data, list):
+            for item in data:
+                result = find_node(item, node_id)
+                if result:
+                    return result
+        return None
+
+    # Find the node with the given node_id
+    node = find_node(data, node_id)
+
+    # Check if the node is a leaf node
+    if node and not node.get('child_nodes'):
+        return True
+    return False
+
+def get_last_node(structure):
+    return structure[-1]
+
+
+def extract_text_from_pdf(pdf_path):
+    pdf_reader = PyPDF2.PdfReader(pdf_path)
+    ###return text not list 
+    text=""
+    for page_num in range(len(pdf_reader.pages)):
+        page = pdf_reader.pages[page_num]
+        text+=page.extract_text()
+    return text
+
+def get_pdf_title(pdf_path):
+    pdf_reader = PyPDF2.PdfReader(pdf_path)
+    meta = pdf_reader.metadata
+    title = meta.title
+    return title
+
+def get_text_of_pages(pdf_path, start_page, end_page, tag=True):
+    pdf_reader = PyPDF2.PdfReader(pdf_path)
+    text = ""
+    for page_num in range(start_page-1, end_page):
+        page = pdf_reader.pages[page_num]
+        page_text = page.extract_text()
+        if tag:
+            text += f"<start_index_{page_num+1}>\n{page_text}\n<end_index_{page_num+1}>\n"
+        else:
+            text += page_text
+    return text
+
+def get_first_start_page_from_text(text):
+    start_page = -1
+    start_page_match = re.search(r'<start_index_(\d+)>', text)
+    if start_page_match:
+        start_page = int(start_page_match.group(1))
+    return start_page
+
+def get_last_start_page_from_text(text):
+    start_page = -1
+    # Find all matches of start_index tags
+    start_page_matches = re.finditer(r'<start_index_(\d+)>', text)
+    # Convert iterator to list and get the last match if any exist
+    matches_list = list(start_page_matches)
+    if matches_list:
+        start_page = int(matches_list[-1].group(1))
+    return start_page
+
+
+
+
+def sanitize_filename(filename, replacement='-'):
+    # In Linux, only '/' and '\0' (null) are invalid in filenames.
+    # Null can't be represented in strings, so we only handle '/'.
+    return filename.replace('/', replacement)
+
+class JsonLogger:
+    def __init__(self, file_path):
+        # Extract PDF name without extension for logger name and filename
+        # pdf_name = os.path.splitext(os.path.basename(file_path))[0]
+        if isinstance(file_path, str):
+            pdf_name = os.path.splitext(os.path.basename(file_path))[0]
+        elif isinstance(file_path, BytesIO):
+            pdf_reader = PyPDF2.PdfReader(file_path)
+            meta = pdf_reader.metadata
+            pdf_name = meta.title if meta.title else 'Untitled'
+            pdf_name = sanitize_filename(pdf_name)
+            
+        current_time = datetime.now().strftime("%Y%m%d_%H%M%S")
+        self.filename = f"{pdf_name}_{current_time}.json"
+        os.makedirs("./logs", exist_ok=True)
+        # Initialize empty list to store all messages
+        self.log_data = []
+
+    def log(self, level, message, **kwargs):
+        if isinstance(message, dict):
+            self.log_data.append(message)
+        else:
+            self.log_data.append({'message': message})
+        # Add new message to the log data
+        
+        # Write entire log data to file
+        with open(self._filepath(), "w") as f:
+            json.dump(self.log_data, f, indent=2)
+
+    def info(self, message, **kwargs):
+        self.log("INFO", message, **kwargs)
+
+    def error(self, message, **kwargs):
+        self.log("ERROR", message, **kwargs)
+
+    def debug(self, message, **kwargs):
+        self.log("DEBUG", message, **kwargs)
+
+    def exception(self, message, **kwargs):
+        kwargs["exception"] = True
+        self.log("ERROR", message, **kwargs)
+
+    def _filepath(self):
+        return os.path.join("logs", self.filename)
+    
+
+
+
+def list_to_tree(data):
+    def get_parent_structure(structure):
+        """Helper function to get the parent structure code"""
+        if not structure:
+            return None
+        parts = str(structure).split('.')
+        return '.'.join(parts[:-1]) if len(parts) > 1 else None
+    
+    # First pass: Create nodes and track parent-child relationships
+    nodes = {}
+    root_nodes = []
+    
+    for item in data:
+        structure = item.get('structure')
+        node = {
+            'title': item.get('title'),
+            'start_index': item.get('start_index'),
+            'end_index': item.get('end_index'),
+            'child_nodes': []
+        }
+        
+        nodes[structure] = node
+        
+        # Find parent
+        parent_structure = get_parent_structure(structure)
+        
+        if parent_structure:
+            # Add as child to parent if parent exists
+            if parent_structure in nodes:
+                nodes[parent_structure]['child_nodes'].append(node)
+            else:
+                root_nodes.append(node)
+        else:
+            # No parent, this is a root node
+            root_nodes.append(node)
+    
+    # Helper function to clean empty children arrays
+    def clean_node(node):
+        if not node['child_nodes']:
+            del node['child_nodes']
+        else:
+            for child in node['child_nodes']:
+                clean_node(child)
+        return node
+    
+    # Clean and return the tree
+    return [clean_node(node) for node in root_nodes]
+
+def add_preface_if_needed(data):
+    if not isinstance(data, list) or not data:
+        return data
+
+    if data[0]['physical_index'] is not None and data[0]['physical_index'] > 1:
+        preface_node = {
+            "structure": "0",
+            "title": "Preface",
+            "physical_index": 1,
+        }
+        data.insert(0, preface_node)
+    return data
+
+
+
+def get_page_tokens(pdf_path, model="gpt-4o-2024-11-20", pdf_parser="PyPDF2"):
+    if pdf_parser == "PyPDF2":
+        pdf_reader = PyPDF2.PdfReader(pdf_path)
+    elif pdf_parser == "PyMuPDF":
+        pdf_reader = pymupdf.open(pdf_path)
+    else:
+        raise ValueError(f"Unsupported PDF parser: {pdf_parser}")
+
+    enc = tiktoken.encoding_for_model(model)
+    
+    page_list = []
+    for page_num in range(len(pdf_reader.pages)):
+        page = pdf_reader.pages[page_num]
+        page_text = page.extract_text()
+        token_length = len(enc.encode(page_text))
+        page_list.append((page_text, token_length))
+    
+    return page_list
+
+
+
+        
+
+def get_text_of_pdf_pages(pdf_pages, start_page, end_page):
+    text = ""
+    for page_num in range(start_page-1, end_page):
+        text += pdf_pages[page_num]
+    return text
+
+def get_number_of_pages(pdf_path):
+    pdf_reader = PyPDF2.PdfReader(pdf_path)
+    num = len(pdf_reader.pages)
+    return num
+
+
+
+def post_processing(structure, end_physical_index):
+    # First convert page_number to start_index in flat list
+    for i, item in enumerate(structure):
+        item['start_index'] = item.get('physical_index')
+        if i < len(structure) - 1:
+            if structure[i + 1].get('appear_start') == 'yes':
+                item['end_index'] = structure[i + 1]['physical_index']-1
+            else:
+                item['end_index'] = structure[i + 1]['physical_index']
+        else:
+            item['end_index'] = end_physical_index
+    tree = list_to_tree(structure)
+    if len(tree)!=0:
+        return tree
+    else:
+        ### remove appear_start 
+        for node in structure:
+            node.pop('appear_start', None)
+            node.pop('physical_index', None)
+        return structure
+
+def clean_structure_post(data):
+    if isinstance(data, dict):
+        data.pop('page_number', None)
+        data.pop('start_index', None)
+        data.pop('end_index', None)
+        if 'child_nodes' in data:
+            clean_structure_post(data['child_nodes'])
+    elif isinstance(data, list):
+        for section in data:
+            clean_structure_post(section)
+    return data
+
+
+def remove_structure_text(data):
+    if isinstance(data, dict):
+        data.pop('text', None)
+        if 'child_nodes' in data:
+            remove_structure_text(data['child_nodes'])
+    elif isinstance(data, list):
+        for item in data:
+            remove_structure_text(item)
+    return data
+
+
+def check_token_limit(structure, limit=110000):
+    list = structure_to_list(structure)
+    for node in list:
+        num_tokens = count_tokens(node['text'], model='gpt-4o')
+        if num_tokens > limit:
+            print(f"Node ID: {node['node_id']} has {num_tokens} tokens")
+            print("Start Index:", node['start_index'])
+            print("End Index:", node['end_index'])
+            print("Title:", node['title'])
+            # print(node['text'])
+            print("\n")
+
+
+def convert_physical_index_to_int(data):
+    if isinstance(data, list):
+        for i in range(len(data)):
+            if isinstance(data[i]['physical_index'], str):
+                if data[i]['physical_index'].startswith('<physical_index_'):
+                    data[i]['physical_index'] = int(data[i]['physical_index'].split('_')[-1].rstrip('>').strip())
+                elif data[i]['physical_index'].startswith('physical_index_'):
+                    data[i]['physical_index'] = int(data[i]['physical_index'].split('_')[-1].strip())
+    elif isinstance(data, str):
+        if data.startswith('<physical_index_'):
+            data = int(data.split('_')[-1].rstrip('>').strip())
+        elif data.startswith('physical_index_'):
+            data = int(data.split('_')[-1].strip())
+        ###check data is int
+        if isinstance(data, int):
+            return data
+        else:
+            return None
+    return data
+
+
+def convert_page_to_int(data):
+    for item in data:
+        if 'page' in item and isinstance(item['page'], str):
+            try:
+                item['page'] = int(item['page'])
+            except ValueError:
+                # Keep original value if conversion fails
+                pass
+    return data