PageIndex/pageindex/node_list.json

[
  {
    "title": "Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis",
    "line_num": 1,
    "level": 1,
    "text": "# Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis\n\nJiawei Wang ${ }^{\\mathrm{a}, \\mathrm{b}, 1, *}$, Kai $\\mathrm{Hu}^{\\mathrm{a}, \\mathrm{b}, 1, *}$, Zhuoyao Zhong ${ }^{\\mathrm{b}, 1, *}$, Lei Sun ${ }^{\\mathrm{b}, 1}$, Qiang Huo ${ }^{\\mathrm{b}}$<br>${ }^{a}$ Department of EEIS, University of Science and Technology of China, Hefei, 230026, China<br>${ }^{\\mathrm{b}}$ Microsoft Research Asia, Beijing, 100080, China",
    "text_token_count": 25658
  },
  {
    "title": "Abstract",
    "line_num": 6,
    "level": 2,
    "text": "## Abstract\n\nDocument structure analysis (aka document layout analysis) is crucial for understanding the physical layout and logical structure of documents, with applications in information retrieval, document summarization, knowledge extraction, etc. In this paper, we concentrate on Hierarchical Document Structure Analysis (HDSA) to explore hierarchical relationships within structured documents created using authoring software employing hierarchical schemas, such as LaTeX, Microsoft Word, and HTML. To comprehensively analyze hierarchical document structures, we propose a tree construction based approach that addresses multiple subtasks concurrently, including page object detection (Detect), reading order prediction of identified objects (Order), and the construction of intended hierarchical structure (Construct). We present an effective end-to-end solution based on this framework to demonstrate its performance. To assess our approach, we develop a comprehensive benchmark called Comp-HRDoc, which evaluates the above subtasks simultaneously. Our end-to-end system achieves state-of-the-art performance on two large-scale document layout analysis datasets (PubLayNet and DocLayNet), a high-quality hierarchical document structure reconstruction dataset (HRDoc), and our Comp-HRDoc benchmark. The Comp-HRDoc benchmark is publicly available at https://github.com/microsoft/CompHRDoc.\n\n\nKeywords: Document Layout Analysis, Table of Contents, Reading Order Prediction, Page Object Detection",
    "text_token_count": 260
  },
  {
    "title": "1. Introduction",
    "line_num": 13,
    "level": 2,
    "text": "## 1. Introduction\n\nDocument Structure Analysis (DSA) is a comprehensive process that identifies the fundamental components within a document, encompassing headings, paragraphs, lists, tables, and figures, and subsequently establishes the logical relationships and structures of these components. This process results in a structured\n\n[^0]\n[^0]:    *Corresponding author.\n    Email addresses: wangjiawei@mail.ustc.edu.cn (Jiawei Wang), hk970213@mail.ustc.edu.cn (Kai Hu), zhuoyao.zhong@gmail.com (Zhuoyao Zhong), kuangtongustc@gmail.com (Lei Sun), qianghuo@microsoft.com (Qiang Huo)\n    ${ }^{1}$ Work done when Jiawei Wang and Kai Hu were interns, Zhuoyao Zhong and Lei Sun were FTEs at Multi-Modal Interaction Group, Microsoft Research Asia, Beijing, China.\n\nrepresentation of the document's physical layout that accurately mirrors its logical structure, thereby enhancing the effectiveness and accessibility of information retrieval and processing. In a contemporary digital landscape, the majority of mainstream documents are structured creations, crafted using hierarchical-schema authoring software such as LaTeX, Microsoft Word, and HTML. Consequently, Hierarchical Document Structure Analysis (HDSA), which focuses on extracting and reconstructing the inherent hierarchical structures within these document layouts, has gained significant attention. However, despite its burgeoning popularity, HDSA poses a substantial challenge due to the diversity of document content and the intricate complexity of their layouts.\n\nOver the past three decades, document structure analysis has garnered significant interest in the research community. Early research efforts primarily focused on physical layout analysis and logical structure analysis, employing various approaches such as knowledge-based [1], rule-based [2], model-based [3], and grammar-based [4] methods. However, these traditional methods face limitations in terms of effectiveness and scalability due to their susceptibility to noise, ambiguity, and difficulties in handling complex document collections. Furthermore, the absence of quantitative performance evaluations hinders the proper evaluation of these techniques. In the era of deep learning, a growing number of deep learning based approaches have been applied to the field of document structure analysis, leading to notable improvements in performance and robustness. However, these methods primarily focus on specific sub-tasks of DSA, such as Page Object Detection, Reading Order Prediction, and Table of Contents (TOC) Extraction, among others. Despite the substantial progress achieved in these individual sub-tasks, there remains a gap in the research community for a comprehensive end-to-end system or benchmark that addresses all aspects of document structure analysis concurrently. Filling this gap would significantly advance the field and encourage further research in this area.\n\nRecently, hierarchical document structure analysis has gained traction with representative explorations like DocParser and HRDoc. DocParser [5] is the an end-to-end system for parsing document renderings into hierarchical document structures, encompassing all text elements, nested figures, tables, and table cell structures. Initially, the system employs Mask R-CNN [6] to detect all document entities within a document image. Subsequently, it devises a set of rules to predict two predefined relationships (i.e., \"parent_of\" and \"followed_by\") between document entities to parse the complete physical structure of the document. However, the system does not take into account the logical structure of documents, such as the table of contents, and its reliance on a rule-based approach considerably limits its overall effectiveness and adaptability. On the other hand, HRDoc [7] proposed an encoder-decoder based hierarchical document structure parsing system (DSPS) to reconstruct the hierarchical structure of documents. This system employs a multi-modal bidirectional encoder and a structure-aware GRU decoder to predict the logical roles of the te
    "text_token_count": 1886
  },
  {
    "title": "2. Related Work",
    "line_num": 51,
    "level": 2,
    "text": "## 2. Related Work\n\nSince the 1980s, numerous studies have been conducted on document structure analysis, which can be categorized into physical structure analysis (or physical layout analysis) and logical structure analysis [11]. Physical layout analysis focuses on identifying homogeneous regions of interest, also known as page objects, while logical structure analysis aims to assign logical roles to these regions and determine their relationships. Early approaches to document structure analysis, mainly based on heuristic rules or grammar analysis, can be found in surveys [11, 12]. In the past decade, a growing body of research $[8,9,13,14]$ has focused on document layout analysis, specifically physical layout analysis and logical role classification, which is also known as page object detection [15]. To maintain clarity, we will consistently use the term \"page object detection\" throughout this article to refer to the document layout analysis task that incorporates both physical layout analysis and logical role classification. In addition to detecting page objects, numerous research studies have delved into the logical relationships between components within documents. These investigations have focused on aspects such as the reading order relationships and the organization of tables of contents. In this section, we primarily review and analyze recent developments in page object detection, reading order prediction, and hierarchical document structure reconstruction, providing an overview of the latest advancements and methodologies in these areas.",
    "text_token_count": 2882
  },
  {
    "title": "2.1. Page Object Detection",
    "line_num": 55,
    "level": 3,
    "text": "### 2.1. Page Object Detection\n\nPage Object Detection, also known as POD [15], is a task that involves locating logical objects (e.g., paragraphs, tables, mathematical equations, graphics, and figures) within document pages. Deep learning-\n\nbased POD approaches can be broadly classified into three categories: object detection-based methods, semantic segmentation-based methods, and graph-based methods.\n\nObject detection-based methods. These methods leverage the latest top-down object detection or instance segmentation frameworks to address the page object detection problem. Pioneering efforts by Yi et al. [16] and Oliveira et al. [17] adapted R-CNN [18] to identify and recognize page objects from document images. However, their performance was hindered by the limitations of traditional region proposal generation strategies. Subsequent research explored more sophisticated object detectors, such as Fast RCNN [19], Faster R-CNN [20], Mask R-CNN [6], Cascade R-CNN [21], SOLOv2 [22], CondInst [23], YOLOv5 [24], and Deformable DETR [25] as investigated by Vo et al. [26], Zhong et al. [8], Saha et al. [27], Li et al. [28], Biswas et al. [29], Hu et al. [30], Pfitzmann et al. [9], and Yang et al. [31], respectively. In addition, researchers have proposed effective techniques to enhance the performance of these detectors. For example, Zhang et al. [32] introduced a multi-modal Faster/Mask R-CNN model for page object detection that fused visual feature maps extracted by CNN with two 2D text embedding maps containing sentence and character embeddings. They also incorporated a graph neural network (GNN) based relation module to model the interactions between page object candidates. Shi et al. [33] proposed a novel lateral feature enhancement backbone network, while Yang et al. [31] employed Swin Transformer [34] as a more robust backbone network to boost the performance of Mask R-CNN and Deformable DETR for page object detection. Recently, Gu et al. [35], Li et al. [28], and Huang et al. [36] further improved the performance of Faster R-CNN, Mask R-CNN, and Cascade R-CNN-based page object detectors by pre-training the vision backbone networks on large-scale document images using self-supervised learning algorithms. Despite achieving state-of-the-art results on several benchmark datasets, these methods continue to face challenges in detecting small-scale text regions.\n\nSemantic segmentation based methods. These methods, such as those proposed by Yang et al. [13], He et al. [37], Li et al. [38, 39], and Sang et al. [40], typically employ existing semantic segmentation frameworks, such as FCN [41], to initially generate a pixel-level segmentation mask. Subsequently, the pixels are merged to form distinct types of page objects. Yang et al. [13] introduced a multi-modal FCN for page object segmentation, which combined visual feature maps and 2D text embedding maps with sentence embeddings to enhance pixel-wise classification accuracy. He et al. [37] developed a multi-scale, multi-task FCN designed to concurrently predict a region segmentation mask and a contour segmentation mask. After refinement using a conditional random field (CRF) model, these two segmentation masks are processed by a post-processing module to obtain the final prediction results. Li et al. [39] integrated label pyramids and deep watershed transformation into the vanilla FCN structure to prevent the merging of adjacent page objects. Despite their advancements, the performance of existing semantic segmentation-based methods remains inferior to that of the other two categories of approaches when evaluated on recent document layout analysis benchmarks.\n\nGraph-based methods. These approaches (e.g., [42-45]) represent each document page as a graph, where the nodes correspond to primitive page objects (e.g., words, text-lines, connected components), and the edges denote relationships between neighboring primitive page objects. The detection of page objects is then formulated as a graph labeling problem. Li et al. [42] employed image processing techniq
    "text_token_count": 1071
  },
  {
    "title": "2.2. Reading Order Prediction",
    "line_num": 67,
    "level": 3,
    "text": "### 2.2. Reading Order Prediction\n\nThe objective of reading order prediction is to determine the appropriate reading sequence for documents. Generally, humans tend to read documents in a left-to-right and top-to-bottom manner. However, such simplistic sorting rules may prove inadequate when applied to complex documents with tokens extracted by OCR tools. Previous research has attempted to tackle the reading order issue using a variety of approaches. As categorized by Wang et al. [49], these methods can be broadly classified into rule-based sorting and machine learning-based sequence prediction, among others.\n\nRule-based sorting. Topological sorting, first introduced by Breuel [50], has been utilized for document layout analysis. In this method, partial orders are determined based on the x/y interval overlaps between text lines, enabling the generation of reading order patterns for multi-column text layouts. A bidimensional relation rule, proposed in [51], offers similar topological rules while also incorporating a row-wise rule by inverting the x/y axes from column-wise. In the same vein, an argumentation-based approach in [52] utilizes rules derived from relationships between text blocks. For text layouts with hierarchies and larger sizes, XYCut $[53,54]$ can serve as an efficient method to order all text blocks from top to bottom and left to right for specific layout types. Despite their effectiveness in certain scenarios, these rule-based methods can be prone to failure when confronted with out-of-domain cases.\n\nMachine learning-based sequence prediction. Designed to learn from training examples across various domains, machine learning-based approaches aim to provide a general solution for reading order prediction. Ceci et al. [55] introduced a probabilistic classifier within the Bayesian framework, which is\n\ncapable of reconstructing single or multiple chains of layout components based on learned partial orders. Differently, an inductive logic programming (ILP) learning algorithm was applied in [56] to learn two kinds of predicates, first_to_read/1 and succ_in_reading/2, thereby establishing an ordering relationship. In recent years, deep learning models have emerged as the leading solution for numerous machine learning challenges. Li et al. [57] proposed an end-to-end OCR text reorganizing model, using a graph convolutional encoder and a pointer network decoder to reorder text blocks. LayoutReader [58] introduced a benchmark dataset called ReadingBank, which contains reading order, text, and layout information, and employed a transformerbased architecture on spatial-text features to predict the reading order sequence of words. However, the decoding speed of these auto-regressive-based methods is limited when applied to rich text documents. Recently, Quir'os et al. [59] followed the idea of assuming a pairwise partial order at the element level from [50] and proposed two new reading-order decoding algorithms for reading order prediction on handwritten documents. They also provided a theoretical background for these algorithms. A significant limitation of this approach is that the partial order between two entities is determined solely by pair-wise spatial features, without considering the visual information and textual information.",
    "text_token_count": 612
  },
  {
    "title": "2.3. Hierarchical Document Structure Reconstruction",
    "line_num": 77,
    "level": 3,
    "text": "### 2.3. Hierarchical Document Structure Reconstruction\n\nThe process of reconstructing a document's hierarchical structure aims to recover its logical structure, which conveys semantic information beyond the character strings that comprise its contents. Table of Contents is a crucial component in reconstructing the hierarchical structure. Consequently, existing research studies on hierarchical structure reconstruction can be broadly categorized into two groups. The first group primarily focuses on extracting the table of contents within documents. The second group places emphasis on overall structure reconstruction of a document.\n\nTable of Contents. Table of contents extraction is the task of restoring the structure of a document and recognizing the hierarchy of its sections. It is a challenging task due to the diversity of TOC styles and layouts. Early methods relied on heuristic rules derived from small data sets for specific domains, which were not effective in large-scale heterogeneous documents. Wu et al. [60] identified three basic TOC styles: \"flat\", \"ordered\", and \"divided\". Based on these styles, they proposed an approach for TOC recognition that adaptively selects appropriate rules according to the basic TOC style features. However, this method assumes the existence of a Table of Contents page within the documents. Nguyen et al. [61] proposed a system that combines a TOC page detection method with a link-based TOC reconstruction method to address the TOC extraction problem. Cao et al. [62] developed a framework called Hierarchy Extraction from Long Document (HELD) to tackle the problem of TOC extraction in long documents. This approach sequentially inserts each section heading into the TOC tree at the correct position, considering sibling and parent information using LSTM [63]. Recently, Hu et al. [64] proposed an end-to-end model by using a multimodal tree decoder (MTD) for table of contents extraction. The MTD model fuses multimodal features\n\nfor each entity of the document and parses the hierarchical relationship by a tree-structured decoder.\n\nOverall Structure Reconstruction. To reconstruct the overall structure of a document, it is critical to represent the structure and layout of the document. Intuitively, graph representation for document structure is most general and can encapsulate the relationship between regions and their properties. The graph representation, however, fails to capture the hierarchical nature of a document structure and layout. Also, it is hard to define a complete graph representation for a document. To accomplish this, one could use a rooted tree for representing document layout and logical structure [65]. One of the most powerful ways to express hierarchical structures is to use formal grammars [66]. The class of regular and context-free grammars are extremely useful in describing the structure of most documents. However, there could be multiple derivations corresponding to a particular sequence of terminals. This would mean multiple interpretations of the structure or layout. Tateisi et al. [67] proposed a stochastic grammar to integrate multiple evidences and estimate the most probable parse or interpretation of a given document. Despite its usefulness, stochastic grammars may lack the flexibility to model complex patterns and structures, particularly when handling highly diverse data. In recent years, some deep learning based methods are proposed for tree-based document structure reconstruction. Wang et al. [68] concentrated on form understanding task, treating the form structure as a tree-like hierarchy composed of text fragments. To predict the relationship between each pair of text fragments, they employed an asymmetric parameter matrix. However, this approach resulted in high computational complexity when dealing with documents containing a large number of text fragments. DocParser, as proposed by Rausch et al. [5], presented an end-to-end system designed to parse the complete physical structure of documents including all text elements, nested fig
    "text_token_count": 927
  },
  {
    "title": "3. Problem Definition",
    "line_num": 87,
    "level": 2,
    "text": "## 3. Problem Definition\n\nThe majority of document types, such as scientific papers, books, reports, and legal documents, typically exhibit a hierarchical document structure in a tree-like format. In this structure, the nodes within the\n\n![img-1.jpeg](img-1.jpeg)\n\nFigure 2: Hierarchical structure reconstruction of a document by integrating the Reading Order and Table of Contents. Blue arrows demonstrate the Text Region Reading Order Relationship, green arrows show the Graphical Region Relationship, and red arrows signify the TOC Relationship. The nodes \"P\", \"S\", \"C\", \"T\" and \"F\" represent Paragraph, Section heading, Caption, Table and Footnote, respectively.\ntree represent various page objects (e.g., section, paragraph, figure, caption) of the document, while the edges signify the hierarchical relationships and connections between these page objects. Given a multi-page document $D$ comprised of $D_{1}, D_{2}, \\ldots, D_{n}$, where $D_{i}$ represents an individual page within document $D$, the primary objective of hierarchical document structure analysis is to reconstruct its hierarchical structure tree $H$, consisting of both page objects and hierarchical relationships as follows:\n\nPage Objects $\\left(O_{i}, i=1, \\ldots, m\\right)$ refer to the various page objects within document $D$. Each page object is described by three attributes: 1) its logical role category $\\mathbf{c}_{i} \\in C$ (e.g., title, section heading, table, figure, etc.); 2) its bounding box coordinates $\\mathbf{b}_{i}$; 3) its basic semantic units (not useful for graphical page objects and we use OCR'd text-lines as basic semantic units).\n\nHierarchical Relationships $\\left(R_{i j}, i, j=1, \\ldots, m\\right)$ describe the relationships between page object pairs and are represented by triplets $\\left(O_{i}, \\boldsymbol{r}_{i j}, O_{j}\\right)$. Each triplet includes a subject page object $O_{i}$, an object page object $O_{j}$, and a relation type $\\boldsymbol{r}_{i j} \\in \\Phi$. Based on the categories of $O_{i}$ and $O_{j}$, we define the following three relationship types: 1) Text Region Reading Order Relationship between main body text regions, 2) Graphical Region Relationship between caption, footnote and graphical page objects, i.e., tables or figures; 3) Table of Contents Relationship between section heading regions.\n\nThe combination of page objects and hierarchical relationships is sufficient to reconstruct the hierarchical tree $H$ for a document, as illustrated in Fig. 2. Conversely, the hierarchy tree $H$ can be used to extract various hierarchical relationships as needed, further emphasizing its importance in the process of hierarchical document structure analysis. For instance, the reading order sequence can be obtained by performing a preorder traversal on the hierarchical tree $H$. Based on the problem description and objectives of hierarchical\n\ndocument structure analysis, we divide it into following three distinct sub-tasks, which correspond to our proposed three-stage framework:\n\n- Page Object Detection (Detect stage) aims to identify individual page object $O_{i}$ (e.g., text regions, images, tables) within each page of the document $D$ and assign a logical role to each detected page object (e.g., section headings, captions, footnotes).\n- Reading Order Prediction (Order stage) focuses on determining the reading sequence of detected page objects based on their spatial arrangement within the document $D$. The reading order is represented as a permutation of the indices of the detected page objects.\n- Table of Contents Extraction (Construct stage) aims to extract the table of contents within document $D$, which involves constructing a hierarchy tree that summarizes the overall hierarchical structure $H$. The hierarchy tree comprises a list of section headings and their hierarchical levels.\n\nBy integrating the results from all three sub-tasks, the hierarchical document structure tree $H$ can be effectively reconstructed, offering a more comprehensive understanding of complex documents
    "text_token_count": 846
  },
  {
    "title": "4. Methodology",
    "line_num": 110,
    "level": 2,
    "text": "## 4. Methodology",
    "text_token_count": 6273
  },
  {
    "title": "4.1. Overview",
    "line_num": 112,
    "level": 3,
    "text": "### 4.1. Overview\n\nOur newly proposed tree construction based approach for hierarchical document structure analysis, named Detect-Order-Construct, is illustrated in Fig. 1. This approach comprises three main components: 1) A Detect stage that identifies individual page objects within the document rendering and assigns a logical role to each detected page object (i.e., page object detection); 2) An Order stage responsible for determining the sequential order of the page objects (i.e., reading order prediction); and 3) A Construct stage that extracts the abstract hierarchy tree (i.e., table of contents extraction). By integrating the outputs from the previous tasks, we can effectively reconstruct a complete hierarchical document structure tree (i.e., hierarchical document structure reconstruction).\n\nIn our approach, we uniformly define the tasks of these three stages as relation prediction problems and present a type of multi-modal, transformer-based relation prediction models to tackle all tasks effectively. Our proposed relation prediction model approaches relation prediction as a dependency parsing task and incorporates structure-aware designs that align with the chain structure of reading order and the tree structure of table of contents. Utilizing our novel techniques and the proposed framework, we develop an effective end-to-end solution for hierarchical document structure analysis, which comprises three modules: the Detect module, the Order module, and the Construct module. We elaborate on the details of these three modules in Sections 4.2, 4.3, and 4.4, respectively.\n\n![img-2.jpeg](img-2.jpeg)\n\nFigure 3: The overall architecture of our Detect module.",
    "text_token_count": 315
  },
  {
    "title": "4.2. Detect Module",
    "line_num": 122,
    "level": 3,
    "text": "### 4.2. Detect Module\n\nThe proposed Detect module consists of three primary components: 1) A shared visual backbone network designed to extract multi-scale feature maps from input document images; 2) A top-down graphical page object detection model for detecting graphical page objects, such as tables, figures, and displayed formulas; 3) A bottom-up text region detection model that groups text-lines located outside graphical page objects into text regions, based on the intra-region reading order, and identifies the logical role of each text region. The overall architecture of the Detect module is illustrated in Fig. 3. In our conference paper [10], we selected a ResNet-50 network as the backbone network to generate multi-scale feature maps and the DINO [69] as the top-down graphical page object detector to localize these graphical objects. However, any suitable visual backbone network and object detection or instance segmentation model can be readily incorporated into our Detect module. In this paper, we primarily concentrate on the details of the newly proposed *Bottom-up Text Region Detection Model*.\n\nA text region is a semantic unit of writing that comprises a group of text-lines arranged in natural reading order and associated with a logical label, such as paragraph, list/list-item, title, section heading, header, footer, footnote, and caption. Given a document page rendering $D_{i}$ composed of $n$ text-lines $[t_1, t_2, ..., t_n]$, the objective of our bottom-up text region detection model is to group these text-lines into distinct text regions according to the intra-region reading order and to recognize the logical role of each text region. In this study, we assume that the bounding boxes and textual contents of text-lines have already been provided by a PDF parser or OCR engine. Based on the detection results of the top-down graphical page object detection model, we initially filter out those text-lines located inside graphical page objects and then\n\n![img-3.jpeg](img-3.jpeg)\n\nFigure 4: A schematic view of the proposed bottom-up text region detection model.\n\nutilize the remaining text-lines as input. As depicted in Fig. 4, our bottom-up text region detection model consists of a multi-modal feature extraction module, a multi-modal feature enhancement module, and two prediction heads, i.e., an intra-region reading order relation prediction head and a logical role classification head. The detailed illustrations of the multi-modal feature enhancement module and the two prediction heads can be found in Fig. 5.",
    "text_token_count": 2924
  },
  {
    "title": "4.2.1. Multi-modal Feature Extraction Module",
    "line_num": 134,
    "level": 4,
    "text": "#### 4.2.1. Multi-modal Feature Extraction Module\n\nIn this module, we extract the visual embedding, text embedding, and 2D Positional Embedding for each text-line.\n\n**Visual Embedding.** As shown in Fig. 4, we first resize $C_4$ and $C_5$ to the size of $C_3$ and then concatenate these three feature maps along the channel axis, which are fed into a $3 \\times 3$ convolutional layer to generate a feature map $C_{fuse}$ with 256 channels. For each text-line $t_i$, we adopt the RoIAlign algorithm [6] to extract $7 \\times 7$ feature maps from $C_{fuse}$ based on its bounding box $b_{t_i} = (x^1_i, y^1_i, x^2_i, y^2_i)$, where $(x^1_i, y^1_i)$, $(x^2_i, y^2_i)$ represent the coordinates of its upper left and bottom right corners, respectively. The final visual embedding $V_{t_i}$ of $t_i$ can be represented as:\n\n$$V_{t_i} = LN(ReLU(FC(ROIAlign(C_{fuse}, b_{t_i})))),\\tag{1}$$\n\nwhere FC is a fully-connected layer with 1,024 nodes and LN represents Layer Normalization [70].\n\n**Text Embedding.** We leverage the pre-trained language model BERT [71] to extract the text embedding of each text-line. Specifically, we first serialize all the text-lines in a document image into a 1D sequence by reading them in a top-left to bottom-right order and tokenize the text-line sequence into a\n\n![img-4.jpeg](img-4.jpeg)\n\nFigure 5: Illustration of (a) Multi-modal Feature Enhancement Module; (b) Logical Role Classification Head; (c) Reading Order Relation Prediction Head in bottom-up text region detection model.\nsub-word token sequence, which is then fed into BERT to get the embedding of each token. After that, we average the embeddings of all the tokens in each text-line $t_{i}$ to obtain its text embedding $T_{t_{i}}$, followed by a fully-connected layer with 1,024 nodes to make the dimension the same as that of $V_{t_{i}}$ :\n\n$$\nT_{t_{i}}=L N\\left(\\operatorname{ReLU}\\left(F C\\left(T_{t_{i}}\\right)\\right)\\right)\n$$\n\n2D Positional Embedding. For each text-line $t_{i}$, we encode its bounding box and size information as its 2D Positional Embedding $B_{t_{i}}$ :\n\n$$\nB_{t_{i}}=L N\\left(M L P\\left(x_{i}^{1} / W, y_{i}^{1} / H, x_{i}^{2} / W, y_{i}^{2} / H, w_{i} / W, h_{i} / H\\right)\\right)\n$$\n\nwhere $\\left(w_{i}, h_{i}\\right)$ and $(W, H)$ represent the width and height of $b_{t_{i}}$ and the input image, respectively. MLP consists of 2 fully-connected layers with 1,024 nodes, each of which is followed by ReLU.\n\nFor each text-line $t_{i}$, we concatenate its visual embedding $V_{t_{i}}$, text embeddings $T_{t_{i}}$, and 2D Positional Embedding $B_{t_{i}}$ to obtain its multi-modal representation $U_{t_{i}}$.\n\n$$\nU_{t_{i}}=F C\\left(\\operatorname{Concat}\\left(V_{t_{i}}, T_{t_{i}}, B_{t_{i}}\\right)\\right)\n$$\n\nwhere FC is a fully-connected layer with 1,024 nodes.",
    "text_token_count": 846
  },
  {
    "title": "4.2.2. Multi-modal Feature Enhancement Module",
    "line_num": 171,
    "level": 4,
    "text": "#### 4.2.2. Multi-modal Feature Enhancement Module\n\nAs shown in Fig. 5, we use a lightweight Transformer encoder to further enhance the multi-modal representations of text-lines by modeling their interactions with a self-attention mechanism. Each text-line\n\nis treated as a token of the Transformer encoder and its multi-modal representation is taken as the input embedding:\n\n$$\nF_{t}=\\text { TransformerEncoder }\\left(U_{t}\\right)\n$$\n\nwhere $U_{t}=\\left[U_{t_{1}}, U_{t_{2}}, \\ldots, U_{t_{n}}\\right]$ and $F_{t}=\\left[F_{t_{1}}, F_{t_{2}}, \\ldots, F_{t_{n}}\\right]$ are the input and output embeddings of the Transformer encoder, $n$ is the number of the input text-lines. To save computation, here we only use a 1-layer Transformer encoder, where the head number, dimension of hidden state, and the dimension of feedforward network are set as 12,768 , and 2048 , respectively.",
    "text_token_count": 232
  },
  {
    "title": "4.2.3. Intra-region Reading Order Relation Prediction Head",
    "line_num": 183,
    "level": 4,
    "text": "#### 4.2.3. Intra-region Reading Order Relation Prediction Head\n\nWe propose to use a relation prediction head to predict intra-region reading order relationships between text-lines. Given a text-line $t_{i}$, if a text-line $t_{j}$ is its succeeding text-line in the same text region, we define that there exists an intra-region reading order relationship $\\left(t_{i} \\rightarrow t_{j}\\right)$ pointing from text-line $t_{i}$ to text-line $t_{j}$. If text-line $t_{i}$ is the last (or only) text-line in a text region, its succeeding text-line is considered to be itself. Unlike many previous methods that consider relation prediction as a binary classification task $[42,45]$, we treat relation prediction as a dependency parsing task and use a softmax cross-entropy loss to replace the standard binary cross-entropy loss during optimization by following [72]. Moreover, we adopt a spatial compatibility feature introduced in [73] to effectively model spatial interactions between text-lines for relation prediction.\n\nSpecifically, we use a multi-class (i.e., $n$-class) classifier to calculate a score $s_{i j}$ to estimate how likely $t_{j}$ is the succeeding text-line of $t_{i}$ as follows:\n\n$$\n\\begin{gathered}\nf_{i j}=F C_{q}\\left(F_{t_{i}}\\right) \\circ F C_{k}\\left(F_{t_{j}}\\right)+\\operatorname{MLP}\\left(r_{b_{t_{i}}, b_{t_{j}}}\\right) \\\\\ns_{i j}=\\frac{\\exp \\left(f_{i j}\\right)}{\\sum_{N} \\exp \\left(f_{i j}\\right)}\n\\end{gathered}\n$$\n\nwhere each of $F C_{q}$ and $F C_{k}$ is a single fully-connected layer with 2,048 nodes to map $F_{t_{i}}$ and $F_{t_{j}}$ into different feature spaces; o denotes dot product operation; MLP consists of 2 fully-connected layers with 1,024 nodes and 1 node respectively; $r_{b_{t_{i}}, b_{t_{j}}}$ is a spatial compatibility feature vector between $b_{t_{i}}$ and $b_{t_{j}}$, which is a concatenation of three 6-d vectors:\n\n$$\nr_{b_{t_{i}}, b_{t_{j}}}=\\left(\\Delta\\left(b_{t_{i}}, b_{t_{j}}\\right), \\Delta\\left(b_{t_{i}}, b_{t_{i j}}\\right), \\Delta\\left(b_{t_{j}}, b_{t_{i j}}\\right)\\right)\n$$\n\nwhere $b_{t_{i j}}$ is the union bounding box of $b_{t_{i}}$ and $b_{t_{j}} ; \\Delta(\\cdot, \\cdot)$ represents the box delta between any two bounding boxes. Taking $\\Delta\\left(b_{t_{i}}, b_{t_{j}}\\right)$ as an example, $\\Delta\\left(b_{t_{i}}, b_{t_{j}}\\right)=\\left(d_{i j}^{x_{\\text {ctr }}}, d_{i j}^{y_{\\text {ctr }}}, d_{i j}^{w}, d_{i j}^{h}, d_{j i}^{x_{\\text {ctr }}}, d_{j i}^{y_{\\text {ctr }}}\\right)$, where each dimension is given by:\n\n$$\n\\begin{aligned}\nd_{i j}^{x_{\\text {ctr }}} & =\\left(x_{i}^{\\text {ctr }}-x_{j}^{\\text {ctr }}\\right) / w_{i}, & d_{i j}^{y_{\\text {ctr }}}=\\left(y_{i}^{\\text {ctr }}-y_{j}^{\\text {ctr }}\\right) / h_{i} \\\\\nd_{i j}^{w} & =\\log \\left(w_{i} / w_{j}\\right), & d_{i j}^{b_{i}}=\\log \\left(h_{i} / h_{j}\\right) \\\\\nd_{j i}^{x_{\\text {ctr }}} & =\\left(x_{j}^{\\text {ctr }}-x_{i}^{\\text {ctr }}\\right) / w_{j}, & d_{j i}^{y_{\\text {ctr }}}=\\left(y_{j}^{\\text {ctr }}-y_{i}^{\\text {ctr }}\\right) / h_{j}\n\\end{aligned}\n$$\n\n![img-5.jpeg](img-5.jpeg)\n\nFigure 6: Architecture of our proposed Order module for reading order prediction.\nwhere $\\left(x_{i}^{\\text {ctr }}, y_{i}^{\\text {ctr }}\\right)$ and $\\left(x_{j}^{\\text {ctr }}, y_{j}^{\\text {ctr }}\\right)$ are the center coordinates of $b_{t_{i}}$ and $b_{t_{j}}$, respectively.\nWe select the highest score from scores $\\left[s_{i j}, j=1,2, \\ldots, n\\right]$ and output the corresponding text-line as the succeeding text-line of $t_{i}$. To achieve higher relation prediction accuracy for the intra-region reading order relationship, which has a chain structure, we employ an additional relation prediction head to further identify the preceding text-line for each text-line. The prediction results from both relation prediction heads are then combined to obtain the final results. Based on the predicted intra-region reading order relationships, we group text-lines into text regions using a Union-Find algorithm. The bou
    "text_token_count": 1247
  },
  {
    "title": "4.2.4. Logical Role Classification Head",
    "line_num": 218,
    "level": 4,
    "text": "#### 4.2.4. Logical Role Classification Head\n\nGiven the enhanced multi-modal representations of text-lines $F_{t}=\\left[F_{t_{1}}, F_{t_{2}}, \\ldots, F_{t_{n}}\\right]$, we add a multi-class classifier to predict a logical role label for each text-line and determine the logical role of each text region by the plurality voting of all its constituent text-lines.",
    "text_token_count": 90
  },
  {
    "title": "4.3. Order Module",
    "line_num": 222,
    "level": 3,
    "text": "### 4.3. Order Module\n\nThe Order module focuses on determining the reading sequence of graphical page objects and text regions identified by the Detect module within document $D$. Similar to the bottom-up text region detection model employed in the Detect module, we also utilize our proposed multi-modal, transformer-based relation prediction model to predict the inter-region reading order relationships among the recognized page objects. The Order module processes the detected page objects as input and employs an attention-based approach to integrate the features of text-lines belonging to the same text region, thereby achieving a more efficient\n\nfeature representation of the text region. Furthermore, we define two categories of inter-region reading order relationships: (1) Text region reading order relationships between main body text regions, (2) Graphical region reading order relationships between captions/footnotes and graphical page objects such as tables and figures. Consequently, we incorporate an additional inter-region reading order relation classification head to predict relation types. A detailed illustration of the Order module can be found in Fig. 6.",
    "text_token_count": 1387
  },
  {
    "title": "4.3.1. Multi-modal Feature Extraction Module",
    "line_num": 228,
    "level": 4,
    "text": "#### 4.3.1. Multi-modal Feature Extraction Module\n\nFollowing Eqs. (1) and (3) as described in Section 4.2.1, we fuse the visual embedding and the 2D positional embedding to obtain a multi-modal representation $U_{O_{m}}$ for each graphical page object $O_{m}$ in a similar manner. For each detected text region page object $O_{n}$ consisting of text-lines $\\left[t_{n_{1}}, t_{n_{2}}, \\ldots, t_{n_{k}}\\right]$, we propose an attention fusion model to integrate the features of text-lines $\\left[F_{t_{n_{1}}}, F_{t_{n_{2}}}, \\ldots, F_{t_{n_{k}}}\\right]$ produced by Eq. (5), thereby forming a multi-modal representation $U_{O_{n}}$ for this text region as follows:\n\n$$\n\\begin{gathered}\n\\alpha_{t_{n_{j}}}=F C_{1}\\left(\\operatorname{tanh}\\left(F C_{2}\\left(F_{t_{n_{j}}}\\right)\\right)\\right) \\\\\nw_{t_{n_{j}}}=\\frac{\\exp \\alpha_{t_{n_{j}}}}{\\sum_{j} \\exp \\alpha_{t_{n_{j}}}} \\\\\nU_{O_{n}}=\\sum_{j} w_{t_{n_{j}}} F_{t_{n_{j}}}\n\\end{gathered}\n$$\n\nwhere both $F C_{1}$ and $F C_{2}$ are single fully-connected layers with 1,024 and 1 nodes, respectively. Furthermore, for each page object, we derive a region type embedding for each page object as follows:\n\n$$\nR_{O_{i}}=L N\\left(\\operatorname{ReLU}\\left(F C\\left(\\operatorname{Embedding}\\left(r_{O_{i}}\\right)\\right)\\right)\\right)\n$$\n\nwhere Embedding is an embedding layer with 1,024 hidden dimension and $r_{O_{i}}$ is the logical role of the page object $O_{i}$.\n\nLastly, we concatenate each page object's multi-modal representation $U_{O_{i}}$ and region type embedding $R_{O_{i}}$ to obtain its final representation $\\hat{U}_{O_{i}}$ as follows:\n\n$$\n\\hat{U}_{O_{i}}=F C\\left(\\operatorname{Concat}\\left(U_{O_{i}}, R_{O_{i}}\\right)\\right)\n$$\n\nwhere $F C$ is a fully-connected layer with 1,024 nodes.",
    "text_token_count": 576
  },
  {
    "title": "4.3.2. Multi-modal Feature Enhancement Module",
    "line_num": 256,
    "level": 4,
    "text": "#### 4.3.2. Multi-modal Feature Enhancement Module\n\nAs illustrated in Fig. 6, we adopt a similar approach to previous multi-modal feature enhancement module in the Group stage. In this case, we utilize a three-layer Transformer encoder to further improve the multi-modal representations of page objects by modeling their interactions using a self-attention mechanism. Each page object is treated as a token of the Transformer encoder, and its multi-modal representation serves as the input embedding:\n\n$$\nF_{O}=\\operatorname{TransformerEncoder}\\left(\\hat{U}_{O}\\right)\n$$\n\n![img-6.jpeg](img-6.jpeg)\n\nFigure 7: Illustration of the Construct module.\nwhere $\\hat{U}_{O}=\\left[\\hat{U}_{O_{1}}, \\hat{U}_{O_{2}}, \\ldots, \\hat{U}_{O_{n}}\\right]$ and $F_{O}=\\left[F_{O_{1}}, F_{O_{2}}, \\ldots, F_{O_{n}}\\right]$ represent the input and output embeddings of the Transformer encoder, and $n$ is the number of the input page objects. The hyperparameters of the transformer encoder are consistent with those in the Detect module, except for the layer number.",
    "text_token_count": 264
  },
  {
    "title": "4.3.3. Inter-region Reading Order Relation Prediction Head",
    "line_num": 269,
    "level": 4,
    "text": "#### 4.3.3. Inter-region Reading Order Relation Prediction Head\n\nOwing to the similarity between the inter-region reading order task of the Order module and the intraregion reading order task of the Detect module, we employ an identical structure for the inter-region reading order relation prediction head in both modules. Further details about this head can be found in Section 4.2.3.",
    "text_token_count": 80
  },
  {
    "title": "4.3.4. Inter-region Reading Order Relation Classification Head",
    "line_num": 273,
    "level": 4,
    "text": "#### 4.3.4. Inter-region Reading Order Relation Classification Head\n\nWe employ a multi-class classifier to compute the probability distribution across various classes in order to determine the relation type between page object $O_{i}$ and page object $O_{j}$. It works as follows:\n\n$$\n\\begin{gathered}\np_{i j}=\\operatorname{BiLinear}\\left(F C_{q}\\left(F_{O_{i}}\\right), F C_{k}\\left(F_{O_{j}}\\right)\\right) \\\\\nc_{i j}=\\operatorname{argmax}\\left(p_{i j}\\right)\n\\end{gathered}\n$$\n\nwhere both $F C_{q}$ and $F C_{k}$ represent single fully-connected layers with 2,048 nodes, which are used to map $F_{O_{i}}$ and $F_{O_{j}}$ into distinct feature spaces; BiLinear signifies the bilinear classifier; and argmax refers to identifying the index $c_{i j}$ of the maximum value within the given probability distribution $p_{i j}$ as the predicted relation type.\n\n![img-7.jpeg](img-7.jpeg)\n\nFigure 8: Illustration of TOC Relation Prediction Head.",
    "text_token_count": 262
  },
  {
    "title": "4.4. Construct Module",
    "line_num": 290,
    "level": 3,
    "text": "### 4.4. Construct Module\n\nGiven the detected section headings $\\left[s e c_{1}, s e c_{2}, \\ldots, s e c_{k-1}, s e c_{k}\\right]$ arranged according to the predicted reading order sequence for document $D$, the goal of the Construct module is to generate a tree structure representing the hierarchical table of contents. As illustrated in Fig. 7, we extract the multi-modal representation $F_{S_{i}}$ of each section heading $s e c_{i}$ from all page objects' multi-modal representation $F_{O}$ based on the logical role. Subsequently, we input all section headings' representation $U_{S}=\\left[U_{S_{1}}, U_{S_{2}}, \\ldots, U_{S_{k}}\\right]$ into a transformer encoder to further enhance the representations. However, unlike the transformer encoder employed in the Detect module and the Order module, both of which are order-agnostic, the input sequence $U_{S}$ has the correct reading order predicted by the Order module, allowing us to add a positional encoding to convey the reading order information. To incorporate the relative position in the reading order sequence and accommodate a larger scale of page numbers in the document, we utilize an efficient positional encoding method called Rotary Positional Embedding (RoPE) [74]. RoPE encodes the absolute position using a rotation matrix and simultaneously includes the explicit relative position dependency in the self-attention formulation. Following the Multi-modal Feature Enhancement Module, we generate the enhanced representations $F_{S}=\\left[F_{S_{1}}, F_{S_{2}}, \\ldots, F_{S_{k}}\\right]$ for section headings. Finally, we introduce a tree-aware TOC relation prediction head to predict the TOC relationships among these section headings. The specially designed relation prediction head is illustrated in Fig. 8.",
    "text_token_count": 1640
  },
  {
    "title": "4.4.1. TOC Relation Prediction Head",
    "line_num": 294,
    "level": 4,
    "text": "#### 4.4.1. TOC Relation Prediction Head\n\nDuring the generation of the ordered tree for Table of Contents, solely relying on the relationship features between child and parent nodes has proven to be insufficient. Some prior studies [7, 62, 64] have already\n\nobserved that incorporating information from sibling nodes can lead to an improved generation of the TOC. Inspired by these works, we propose two types of TOC relationships between section heads to further enhance the TOC generation process: parent-child relationships and sibling relationships.\n\nThe parent-child relationship is relatively straightforward: when a section heading $s e c_{i}$ serves as the parent node for another section heading $s e c_{j}$ within the TOC tree structure, we define a parent-child relationship $\\left(s e c_{j} \\rightarrow s e c_{i}\\right)$ that points from $s e c_{j}$ to $s e c_{i}$. Sibling relationships in a TOC tree are established as follows: if section heading $s e c_{i}$ acts as the left sibling of section heading $s e c_{j}$, then a sibling relationship $\\left(s e c_{j} \\rightarrow s e c_{i}\\right)$ is present. In cases where a section heading lacks a parent node or left sibling node, its parent-child or sibling relationship is defined as pointing to itself. This approach aims to provide a more comprehensive representation of the relationships among section heads, ultimately leading to a more accurate and robust TOC generation.\n\nAs illustrated in Fig. 8, our proposed TOC Relation Prediction Head comprises two distinct relation prediction heads for the parent-child and sibling relationships, respectively. Both relation prediction heads in our proposed module employ the same network structure. To elaborate, we use the relation prediction head for the parent-child relationship as an example. Specifically, we implement a multi-class (k-class) classifier to compute a score $s_{i j}^{p}$, which estimates the likelihood of $s e c_{j}$ being the parent node of $s e c_{i}$. The calculation is as follows:\n\n$$\n\\begin{gathered}\nf_{i j}=F C_{q}\\left(F_{S_{i}}\\right) \\circ F C_{k}\\left(F_{S_{j}}\\right) \\\\\ns_{i j}^{p}=\\frac{\\exp \\left(f_{i j}\\right)}{\\sum_{j} \\exp \\left(f_{i j}\\right)}\n\\end{gathered}\n$$\n\nwhere each of $F C_{q}$ and $F C_{k}$ represents a single fully-connected layer with 2,048 nodes to map $F_{S_{i}}$ and $F_{S_{j}}$ into distinct feature spaces; $\\circ$ denotes the dot product operation. Similarly, we can obtain the score $s_{i j}^{s}$ to estimate the likelihood of $s e c_{j}$ being the defined sibling node of $s e c_{i}$. This unified network structure allows for efficient and effective prediction of relationships between section heads, contributing to the overall TOC generation process.\n\nIn a manner similar to the previously proposed reading order relation prediction head in Section 4.2.3, we treat relation prediction as a dependency parsing task and employ a softmax cross-entropy loss instead of the standard binary cross-entropy loss during the training phase. During the testing phase, we utilize serial decoding to integrate the outputs of the two relation prediction heads and introduce a tree structure constraint to enhance the final prediction results. Specifically, assuming that $\\left[s e c_{1}, s e c_{2}, \\ldots, s e c_{k}\\right]$ has been sorted according to the predicted reading order, we initialize a tree T containing only one root node, $R O O T$. Subsequently, we devise a tree insertion algorithm, as detailed in Algorithm 1, to insert each section heading in order, ultimately generating a complete table of contents tree. This approach ensures that the predicted relationships between section headings are consistent with the hierarchical tree structure, resulting in a more accurate and coherent TOC.\n\nAlgorithm 1 Tree Insertion Algorithm\n\nRequire: Empty Tree $T=\\{R O O T\\}$, Ordered Section Headings $\\left[s e c_{1}, s e c_{2}, \\ldots, s e c_{k}\\right]$,\nParent Score Matrix $\\mathbf{s}^{\\mathbf{p}}$, Sibling Score Matrix $\\mathbf{s}^{\\mathbf{s}}$\
    "text_token_count": 1243
  },
  {
    "title": "5. Experiments",
    "line_num": 330,
    "level": 2,
    "text": "## 5. Experiments",
    "text_token_count": 8320
  },
  {
    "title": "5.1. Datasets and Evaluation Protocols",
    "line_num": 332,
    "level": 3,
    "text": "### 5.1. Datasets and Evaluation Protocols\n\nIn our conference paper [10], we conducted experiments on two widely-recognized large-scale document layout analysis benchmarks, namely PubLayNet [8] and DocLayNet [9] to validate the effectiveness of our proposed Detect module. In this paper, we carry out extensive experiments on a high-quality public hierarchical document structure reconstruction benchmark, HRDoc [7], to validate the effectiveness of our proposed tree construction based framework. It is important to note that HRDoc solely provides annotations and benchmarks for the logical role classification task and the overall hierarchical structure reconstruction task. However, each sub-task plays a crucial role in hierarchical document structure analysis. Consequently, during the performance evaluation phase, conducting a thorough and rigorous assessment of each involved sub-task is essential.\n\nTo address this issue, we expand upon the foundation of HRDoc and develop a comprehensive benchmark called Comp-HRDoc for hierarchical document structure analysis, which simultaneously evaluates page object detection, reading order prediction, table of contents extraction, and hierarchical document structure reconstruction. It is worth noting that the logical role classification in HRDoc is actually text-line-level, which may not be a fair performance evaluation for top-down approaches. Therefore, we replace it with a more popular and significant subtask, termed page object detection, in our proposed benchmark. To the best of our knowledge, Comp-HRDoc is the first benchmark designed to assess such a diverse array of document structure analysis subtasks. Our proposed model has been rigorously evaluated on this benchmark, further demonstrating the superiority of our approach.\n\nPubLayNet [8] is a large-scale dataset for document layout analysis released by IBM that contains $340,391,11,858$, and 11,983 document pages for training, validation, and testing, respectively. All the documents in this dataset are scientific papers publicly available on PubMed Central, and all the ground-truths are automatically generated by matching the XML representations and the content of the corresponding PDF files. It predefines 5 types of page objects, including Text (i.e., Paragraph), Title, List, Figure, and Table. The evaluation metric for PubLayNet is the COCO-style mean average precision (mAP) at multiple intersection over union (IoU) thresholds between 0.50 and 0.95 with a step of 0.05 .\n\nDocLayNet [9] is a challenging human-annotated document layout analysis dataset newly released by IBM that contains $69,375,6,489$, and 4,999 document pages for training, testing, and validation, respectively. It covers a variety of document categories, including financial reports, patents, manuals, laws, tenders, and scientific papers. It predefines 11 types of page objects, including Caption, Footnote, Formula, Listitem, Page-footer, Page-header, Picture, Section-header, Table, Text (i.e., Paragraph), and Title. The evaluation metric for DocLayNet is also the COCO-style mean average precision (mAP), consistent with that of PubLayNet.\n\nHRDoc [7] is a human-annotated dataset specifically designed to facilitate hierarchical document structure reconstruction. It features line-level annotations and cross-page relations, aiming to recover the semantic structure of PDF documents. In order to accommodate various layout types, the HRDoc dataset is divided into two parts. The first part, HRDoc-Simple (HRDS), consists of 1,000 documents exhibiting similar layouts. The second part, HRDoc-Hard (HRDH), encompasses 1,500 documents with diverse layouts. This heterogeneous collection of documents offers researchers an extensive resource to develop and assess algorithms for hierarchical document structure reconstruction in PDF documents.\n\nTwo evaluation tasks are associated with HRDoc, including semantic unit classification (i.e., logical role classification) and hierarchical structure reconstruction. For the semantic unit classification task, the F1 
    "text_token_count": 1401
  },
  {
    "title": "5.2. Implementation Details",
    "line_num": 352,
    "level": 3,
    "text": "### 5.2. Implementation Details\n\nWe implement our approach using PyTorch v1.10, and all experiments are conducted on a workstation equipped with 8 Nvidia Tesla V100 GPUs ( 32 GB memory). It is crucial to mention that in PubLayNet, a list constitutes an entire object containing multiple list items with labels that are inconsistent with those of text or titles. To minimize ambiguity, we treat all lists as specific graphical page objects.\n\nSince only the task of page object detection needs to be evaluated on PubLayNet and DocLayNet datasets, we only trained the Detect stage in our framework on these two datasets. In our experiments with PubLayNet and DocLayNet, we leverage three multi-scale feature maps $\\left\\{C_{3}, C_{4}, C_{5}\\right\\}$ from the backbone network, along with the DINO-based graphical page object detection model, to identify graphical objects. In training, the parameters of the CNN backbone network are initialized with a ResNet-50 model [77] pretrained on the ImageNet classification task, while the parameters of the text embedding extractor are initialized with the pretrained $\\mathrm{BERT}_{\\text {BASE }}$ model [71]. We optimize the models using the AdamW [78] algorithm with a batch size of 16 and trained for 12 epochs on PubLayNet and 24 epochs on DocLayNet.\n\nThe learning rate and weight decay are set to $1 \\mathrm{e}-5$ and $1 \\mathrm{e}-4$ for the CNN backbone network, and $2 \\mathrm{e}-5$ and $1 \\mathrm{e}-2$ for $\\mathrm{BERT}_{\\text {BASE }}$, respectively. The learning rate is divided by 10 at the $11^{\\text {th }}$ epoch for PubLayNet and $20^{\\text {th }}$ epoch for DocLayNet. Other hyperparameters of AdamW, including betas and epsilon, are set to ( 0.9 , 0.999 ) and $1 \\mathrm{e}-8$, respectively. We also adopt a multi-scale training strategy, randomly rescaling the shorter side of each image to lengths chosen from [512, 640, 768], ensuring the longer side does not exceed 800. During the testing phase, we set the shorter side of the input image to 640 .\n\nFor HRDoc and Comp-HRDoc, we utilize four multi-scale feature maps $\\left\\{C_{2}, C_{3}, C_{4}, C_{5}\\right\\}$ from the backbone network, in conjunction with the Mask2Former-based graphical page object detection model, to identify graphical objects. Given that hierarchical document structure analysis requires processing dozens of document pages, we choose the ResNet-18 model as the CNN backbone network to reduce GPU memory requirements. The parameters of the text embedding extractor are also initialized with the pretrained $\\mathrm{BERT}_{\\text {BASE }}$ model. The models are optimized using the AdamW [78] algorithm with a batch size of 1 and trained for 20 epochs on HRDoc and Comp-HRDoc. The initial learning rate and weight decay are set to $2 \\mathrm{e}-4$ and $1 \\mathrm{e}-2$ for the CNN backbone network, and $4 \\mathrm{e}-5$ and $1 \\mathrm{e}-2$ for $\\mathrm{BERT}_{\\text {BASE }}$, respectively. After a warmup period (set to 2 epochs) during which it increases linearly from 0 to the initial learning rate set in the optimizer, the learning rate linearly decreases from the initial learning rate set in the optimizer to 0 . For multi-scale training strategy, the shorter side of each image is randomly rescaled to a length chosen from $[320,416,512,608,704,800]$, ensuring that the longer side does not exceed 1024. During the testing phase, we set the shorter side of the input image to 512 .",
    "text_token_count": 833
  },
  {
    "title": "5.3. Comparisons with Prior Arts",
    "line_num": 362,
    "level": 3,
    "text": "### 5.3. Comparisons with Prior Arts\n\nThe Detect module we proposed is a novel combination of top-down and bottom-up approaches for page object detection. Therefore, we first validate the effectiveness of our method on two large-scale document layout analysis datasets, i.e., DocLayNet and PubLayNet.\n\nDocLayNet. We compare our proposed Detect module with the other most competitive methods, including Mask R-CNN, Faster R-CNN, YOLOv5, and DINO on DocLayNet. As shown in Table 1, our approach substantially outperforms the closest method YOLOv5 by improving mAP from $76.8 \\%$ to $81.0 \\%$. Considering that DocLayNet is an extremely challenging dataset that covers a variety of document scenarios and contains a large number of text regions with fine-grained logical roles, the superior performance achieved by our proposed approach demonstrates the advantage of our approach.\n\nPubLayNet. We also compare our approach with several state-of-the-art vision-based and multimodal methods on PubLayNet. The experimental results are presented in Table 2 and Table 3. We can see that our approach outperforms all these methods regardless of whether textual features are used in our bottom-up text region detection model.\n\nTo further validate the effectiveness of our proposed tree construction based framework for hierarchical\n\nTable 1: Performance comparisons on the DocLayNet testing set (in \\%). The results of Mask R-CNN, Faster R-CNN, and YOLOv5 are obtained from [9].\n\n|  | Human | Mask R-CNN | Faster R-CNN | YOLOv5 | DINO | Ours |\n| :-- | :--: | :--: | :--: | :--: | :--: | :--: |\n| Caption | $84-89$ | 71.5 | 70.1 | 77.7 | $\\mathbf{8 5 . 5}$ | 83.2 |\n| Footnote | $83-91$ | 71.8 | 73.7 | $\\mathbf{7 7 . 2}$ | 69.2 | 69.7 |\n| Formula | $83-85$ | 63.4 | 63.5 | $\\mathbf{6 6 . 2}$ | 63.8 | 63.4 |\n| List-item | $87-88$ | 80.8 | 81.0 | 86.2 | 80.9 | $\\mathbf{8 8 . 6}$ |\n| Page-footer | $93-94$ | 59.3 | 58.9 | 61.1 | 54.2 | $\\mathbf{9 0 . 0}$ |\n| Page-header | $85-89$ | 70.0 | 72.0 | 67.9 | 63.7 | $\\mathbf{7 6 . 3}$ |\n| Picture | $69-71$ | 72.7 | 72.0 | 77.1 | $\\mathbf{8 4 . 1}$ | 81.6 |\n| Section-header | $83-84$ | 69.3 | 68.4 | 74.6 | 64.3 | $\\mathbf{8 3 . 2}$ |\n| Table | $77-81$ | 82.9 | 82.2 | $\\mathbf{8 6 . 3}$ | 85.7 | 84.8 |\n| Text | $84-86$ | 85.8 | 85.4 | $\\mathbf{8 8 . 1}$ | 83.3 | 84.8 |\n| Title | $60-72$ | 80.4 | 79.9 | 82.7 | 82.8 | $\\mathbf{8 4 . 9}$ |\n| mAP | $82-83$ | 73.5 | 73.4 | 76.8 | 74.3 | $\\mathbf{8 1 . 0}$ |\n\ndocument structure analysis, we performed experiments with our method on both HRDoc and Comp-HRDoc datasets and made thorough comparisons with previous approaches.\n\nHRDoc. As demonstrated in Table 4 and Table 5, we conducted separate performance evaluations for the two tasks in HRDoc, specifically semantic unit classification and hierarchical structure reconstruction. For semantic unit classification, it is evident that our proposed method achieves superior performance in the majority of categories, particularly in the Fstl (Firstline) and Footn (Footnote) classes, where our approach significantly surpasses previous methods. Although the DSPS Encoder is also a multimodal technique that integrates visual and linguistic information, its performance in the Mail category is notably inferior to that of Sentence-BERT. However, on HRDoc-Hard, our method attains an F1 score nearly $5 \\%$ higher than the DSPS Encoder in this category. Regarding hierarchical structure reconstruction, our proposed tree construction based method markedly outperforms the DSPS Encoder. On HRDoc-Hard, we exceed its performance by $16.63 \\%$ and $15.77 \\%$ in Micro-STEDS and Macro-STEDS, respectively. Similarly, on HRDoc-Simple, we surpass the DSPS Encoder by $13.61 \\%$ and $13.36 \\%$ in Micro-STEDS and Macro-STEDS, respectively. It is important to highlight that our proposed method evaluates the performance based on the predicted reading order sequence, whereas the DSPS Encoder directly takes advantage of the ground-truth reading order.\n\nComp-HRDoc. As presented in Table 6, we conduct a compre
    "text_token_count": 2186
  },
  {
    "title": "5.4. Ablation Studies",
    "line_num": 415,
    "level": 3,
    "text": "### 5.4. Ablation Studies\n\nWe conducted a series of ablation experiments based on Comp-HRDoc to verify the impact of using different modules and modalities.\n\nEffectiveness of the hybrid strategy and multimodality in the Detect module. In this section,\n\nTable 3: Performance comparisons on the PubLayNet test set (in %). Vision and Text stand for using visual and textual features, respectively.\n\n| Method | Modality | Text | Title | List | Table | Figure | mAP |\n| --- | --- | --- | --- | --- | --- | --- | --- |\n| Faster R-CNN [8] | Vision | 91.3 | 81.2 | 88.5 | 94.3 | 94.5 | 90.0 |\n| Mask R-CNN [8] | | 91.7 | 82.8 | 88.7 | 94.7 | 95.5 | 90.7 |\n| DocInsightAI [32] | | 94.5 | 88.3 | 94.8 | 95.8 | 97.5 | 94.2 |\n| SCUT [32] | | 94.3 | 89.7 | 94.3 | 96.6 | 97.7 | 94.5 |\n| SRK [32] | | 94.7 | 90.0 | 95.1 | 97.2 | 98.0 | 95.0 |\n| SiliconMinds [32] | | 96.2 | 89.8 | 94.6 | 97.0 | 97.6 | 95.0 |\n| VSR [32] | Vision+Text | 96.7 | 92.3 | 94.6 | 97.0 | 97.9 | 95.7 |\n| Ours | Vision | 95.0 | 96.4 | 95.2 | 97.0 | 97.8 | 96.3 |\n| Ours | Vision+Text | 95.0 | 96.6 | 95.7 | 97.3 | 97.7 | 96.5 |\n\nwe first evaluate the effectiveness of the proposed hybrid strategy in the Detect module. To this end, we train two baseline models: 1) a Mask2Former baseline to detect both graphical page objects and text regions and 2) a hybrid model (denoted as Hybrid (V)) that leverages Mask2Former for graphical object detection and only uses visual and 2D position features for bottom-up text region detection. As shown in the first two rows of Table 7, compared with the Mask2Former-R50 model, the Hybrid-R18 (V) model can achieve comparable graphical page object detection results but much higher text region detection accuracy on Comp-HRDoc, leading to a 9.86% improvement in terms of segmentation-based mAP. In particular, the Hybrid-R18 (V) model can significantly improve small-scale text region detection performance, e.g., 84.67% vs. 68.97% for Page-footnote, 95.08% vs. 59.01% for Page-header and 95.93% vs. 62.68% for Page-footer. These experimental results clearly demonstrate the effectiveness of the proposed hybrid strategy that combines the best of both top-down and bottom-up methods. In addition, we also conducted an ablation experiment to explore the effectiveness of text modalities in the Detect module, as depicted in the last two rows of Table 7. We find that the hybrid model with text modality (denoted as Hybrid (V+T)) achieves much better performance in semantically sensitive categories, such as Author, Mail, and Affiliate, leading to a 4.66% improvement in terms of segmentation-based mAP. Notably, we have observed many cases of inconsistent paragraph annotations in HRDoc, which might be one of the reasons for the relatively lower performance in the Para (Paragraph) category. More ablation studies in the Detect module can be found in our conference paper [10].\n\nEffectiveness of multimodality in the Construct module. In this study, we conducted an ablation experiment to investigate the effects of different modalities, specifically text and image modalities. To study the impact of section numbers on the task of table of contents extraction, we removed the section numbers from the text content of section headings and examined the resulting influence on the extraction\n\nTable 4: Comparison results of different baseline models in the semantic unit classification task on HRDoc (in %). F1 means F1-score. The results of Cascade-RCNN, ResNet+RoIAlign, Sentence-Bert and DSPS Encoder are all obtained from [7].\n\n|  Method | HRDoc-Hard F1 (%) |  |  |  |  |  |  |  |  |  |  |  |  |  |   |\n| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |\n|   | Title | Author | Mail | Affili | Sect | Fstl | Paral | Table | Fig | Cap | Equ | Foot | Head | Footn | Avg. F1 (%)  |\n|   |  |  |  |  |  |  |  |  |  |  |  |  |  |  | Micro Macro  |\n|  Cascade-RCNN | 81.50 | 49.77 | 33.39 | 49.34 | 75.92 | 64.96 | 77.86 | 69.96 | 72.22 | 43.72 | 68.84 | 70.91 | 71.00 | 52.67 | 73.37  |\n|  ResNet+RoIAlign | 82.
    "text_token_count": 3363
  },
  {
    "title": "5.5. Limitations of Our Approach",
    "line_num": 495,
    "level": 3,
    "text": "### 5.5. Limitations of Our Approach\n\nWhile our proposed end-to-end system demonstrates outstanding performance in a majority of tasks, as corroborated by prior experiments, it is not without limitations. For instance, we presume that the section headers supplied to the Construct module from previous stages are accurately recognized. Consequently, the recognition performance of section headings accounts for part of the Construct module's bottleneck. Moreover, the information regarding section numbers is vital for harnessing the semantics of section headings\n\nTable 8: Ablation studies of various modalities in the Construct module on Comp-HRDoc.\n\n| Modality |  |  | Micro-STEDS | Macro-STEDS |\n| --- | --- | --- | --- | --- |\n| Text |  | Image |  |  |\n| w/o Section Number | with Section Number |  |  |  |\n| $\\checkmark$ |  |  | 0.6409 | 0.6834 |\n|  | $\\checkmark$ |  | 0.8341 | 0.8528 |\n|  |  | $\\checkmark$ | 0.8477 | 0.8685 |\n| $\\checkmark$ |  | $\\checkmark$ | 0.8436 | 0.8640 |\n|  | $\\checkmark$ | $\\checkmark$ | 0.8605 | 0.8788 |\n\nTable 9: Ablation studies of various components in TOC Relation Prediction Head on Comp-HRDoc.\n\n| Method | Level | Table of Contents Extraction |  |\n| --- | --- | --- | --- |\n|  |   |   |   |\n|   |  | Micro-STEDS | Macro-STEDS |\n| Ours | Document | $\\mathbf{0 . 8 6 0 5}$ | $\\mathbf{0 . 8 7 8 8}$ |\n| - Sibling Finding | Document | 0.8545 | 0.8712 |\n| - Tree Insert Algorithm | Document | 0.7111 | 0.7652 |\n| - Softmax Cross Entropy Loss | Document | 0.7002 | 0.7475 |\n\nwithin our proposed system. Therefore, for documents lacking section numbers, our approach may not exhibit adequate robustness. Several failure examples are depicted in Fig. 9, with red boxes indicating incorrect predictions and green boxes signifying correct predictions. Note that these difficulties are common challenges faced by other state-of-the-art methods. Finding practical solutions to these problems will be the focus of our future work.",
    "text_token_count": 530
  },
  {
    "title": "6. Conclusion and Future Work",
    "line_num": 524,
    "level": 2,
    "text": "## 6. Conclusion and Future Work\n\nIn this study, we perform a thorough examination of various aspects of hierarchical document structure analysis (HDSA) and propose a tree construction based approach, named Detect-Order-Construct, to simultaneously address multiple crucial subtasks in HDSA. To showcase the effectiveness of this novel framework, we design an effective end-to-end solution and uniformly define the tasks of these three stages as relation prediction problems. Moreover, to comprehensively assess the performance of different approaches, we introduce a new benchmark, termed Comp-HRDoc, which concurrently evaluates page object detection, reading order prediction, table of contents extraction, and hierarchical structure reconstruction. As a result, our proposed end-to-end system attains state-of-the-art performance on two large-scale document layout analysis datasets (i.e., PubLayNet and DocLayNet), a hierarchical document structure reconstruction dataset (i.e., HRDoc), and our comprehensive benchmark (i.e., Comp-HRDoc).\n\n![img-8.jpeg](img-8.jpeg)\n(a) Failure case due to incorrect recognition of section\n(b) Failure case due to the lack of section number. headings.\n\nFigure 9: Some typical failure cases of Table of Contents extraction.\n\nIn future research, we aim to broaden the scope of our framework to encompass a wider range of real-life scenarios, including contracts, financial reports, and handwritten documents. Additionally, we recognize the importance of addressing documents with graph-based logical structures for more general applications. As such, we plan to explore more robust and effective approaches to handle these complex scenarios. Our ongoing efforts are dedicated to finding a comprehensive and universal document structure analysis solution.",
    "text_token_count": 333
  },
  {
    "title": "References",
    "line_num": 536,
    "level": 2,
    "text": "## References\n\n[1] J. Kreich, A. Luhn, G. Maderlechner, An experimental environment for model based document analysis, in: Proceedings of the International Conference on Document Analysis and Recognition, 1991, pp. 50-58.\n[2] S. Tsujimoto, H. Asada, Understanding multi-articled documents, in: Proceedings of the International Conference on Pattern Recognition, 1990, pp. 551-556.\n[3] A. Yamashita, A model based layout understanding method for the document recognition system, in: Proceedings of the International Conference on Document Analysis and Recognition, 1991, pp. 130-140.\n[4] M. Krishnamoorthy, G. Nagy, S. Seth, M. Viswanathan, Syntactic segmentation and labeling of digitized pages from technical journals, IEEE Transactions on Pattern Analysis and Machine Intelligence 15 (7) (1993) 737-747.\n[5] J. Rausch, O. Martinez, F. Bissig, C. Zhang, S. Feuerriegel, Docparser: Hierarchical document structure parsing from renderings, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2021, pp. 4328-4338.\n\n[6] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the International Conference on Computer Vision, 2017, pp. 2961-2969.\n[7] J. Ma, J. Du, P. Hu, Z. Zhang, J. Zhang, H. Zhu, C. Liu, Hrdoc: Dataset and baseline method toward hierarchical reconstruction of document structures, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2023, pp. 18701877 .\n[8] X. Zhong, J. Tang, A. J. Yepes, Publaynet: largest dataset ever for document layout analysis, in: Proceedings of the International Conference on Document Analysis and Recognition, 2019, pp. 1015-1022.\n[9] B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar, P. W. Staar, Doclaynet: A large human-annotated dataset for documentlayout analysis, arXiv preprint arXiv:2206.01062 (2022).\n[10] Z. Zhong, J. Wang, H. Sun, K. Hu, E. Zhang, L. Sun, Q. Huo, A hybrid approach to document layout analysis for heterogeneous document images, in: Proceedings of the International Conference on Document Analysis and Recognition, 2023, pp. $189--206$.\n[11] S. Mao, A. Rosenfeld, T. Kanungo, Document structure analysis algorithms: a literature survey, in: Proceedings of Document Recognition and Retrieval X, 2003, pp. 197-207.\n[12] Y. Y. Tang, S.-W. Lee, C. Y. Suen, Automatic document processing: a survey, Pattern recognition 29 (12) (1996) 19311952 .\n[13] X. Yang, E. Yumer, P. Asente, M. Kraley, D. Kifer, C. Lee Giles, Learning to extract semantic structure from documents using multimodal fully convolutional neural networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 5315-5324.\n[14] M. Li, Y. Xu, L. Cui, S. Huang, F. Wei, Z. Li, M. Zhou, Docbank: A benchmark dataset for document layout analysis, in: Proceedings of the International Conference on Computational Linguistics, 2020, pp. 949-960.\n[15] L. Gao, X. Yi, Z. Jiang, L. Hao, Z. Tang, ICDAR2017 competition on page object detection, in: Proceedings of the International Conference on Document Analysis and Recognition, 2017, pp. 1417-1422.\n[16] X. Yi, L. Gao, Y. Liao, X. Zhang, R. Liu, Z. Jiang, Cnn based page object detection in document images, in: Proceedings of the International Conference on Document Analysis and Recognition, Vol. 1, 2017, pp. 230-235.\n[17] D. A. B. Oliveira, M. P. Viana, Fast cnn-based document layout analysis, in: Proceedings of the International Conference on Computer Vision Workshops, 2017, pp. 1173-1180.\n[18] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2014, pp. 580-587.\n[19] R. Girshick, Fast r-cnn, in: Proceedings of the International Conference on Computer Vision, 2015, pp. 1440-1448.\n[20] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, in: Proceedings of the Advances in Neural Information Processing Systems, 2015, pp. 91-99.\n[21] Z. Cai, N. Vasconcel
    "text_token_count": 4698
  }
]