HiPS: Hierarchical PDF Segmentation of Doctrinal Legal Books
Abstract
PDF parsers have recently improved on page-level layout understanding. However, recovering a document-global section hierarchy with reliable boundaries remains brittle for deeply structured books: many systems expose only page-local heading roles, assume shallow depth, or rely on high-quality PDF tags or Table of Contents (TOC) metadata, and public gold-standard data for deep book hierarchies is scarce. We present HiPS for hierarchical PDF segmentation of doctrinal legal books and make two main contributions. First, we release a gold-standard benchmark of 49 open-access law books with 9,812 manually curated headings, hierarchy levels, and page anchors, enabling evaluation of title detection, hierarchy reconstruction, and section boundary assignment. Second, we introduce complementary segmentation pipelines: a TOC-based parser for books with reliable outline metadata and a TOC-free LLM-refined pipeline that combines OCR whitespace cues, XML typography, and local context. Across a broad comparison against open-source parsers and multimodal/LLM baselines, the TOC-based pipeline is strongest when metadata is complete, while the LLM-refined pipeline improves heading precision, deep-level recovery, and boundary quality when metadata is missing or noisy.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.