jXBW: A Compressed Index for Structure-Aware JSONL Retrieval in Structured RAG

Abstract

Providing structured information to large language models (LLMs) improves multi-step reasoning and factual grounding, and recent retrieval-augmented generation (RAG) systems therefore reconstruct structure from retrieved text on every query. When the corpus is already structured -- as in JSON Lines (JSONL), a popular format for LLM prompts, chemical compounds, and geospatial records -- this per-query rebuilding can be replaced by direct structural retrieval. The core primitive is substructure search: finding all JSON objects in a collection that contain a given query pattern. Existing approaches index each document separately, so both index space and query time grow with the total collection size; XML-based engines add conversion overhead and semantic mismatches. We propose jXBW, a compressed index for fast substructure search over JSONL, combining three innovations: (i) a merged tree representation that consolidates repeated structures across objects, (ii) a succinct tree index based on the eXtended Burrows--Wheeler Transform (XBW), and (iii) a newly developed three-phase substructure search algorithm that runs on this index. Together they achieve query-dependent complexity: the cost is determined by query characteristics rather than collection size, in compressed space. Experiments on seven real-world datasets, including PubChem (106 compounds) and OpenStreetMap (6.6 × 106 objects), show that jXBW outperforms the strongest tree-based baseline by 16× on the smallest dataset and by up to 2,800× on the largest, and is more than 2 × 106× faster than the XQuery engine Saxon. jXBW thus brings structural retrieval over million-record JSONL collections into the sub-millisecond range.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…