Information Extraction of Nested Complex Structure of Quantum Cascade Lasers via Large Language Models

Abstract

The rapid advancement of Large Language Models has transformed scientific research workflows, including enabling the automated extraction of data directly from published literature. Most existing efforts, however, focus on extracting simple labeled key-value entities, whereas many scientific applications require more complex, hierarchically structured data. A representative example is Quantum Cascade Lasers, whose device architectures are defined by tens of interdependent parameters organized in nested layer sequences. In this work we propose a JSON-Schema Guided Information Extraction Pipeline (JSG-IE) that enables reliable extraction of deeply structured device data without model fine-tuning. By transforming extraction into a schema-constrained generation task, our approach significantly improves structural consistency and accuracy. Across 12 state-of-the-art LLMs, a properly designed JSON Schema improves performance by 5.7\% over conventional prompting, with the highest F1 score up to 83.4\%, achieved by the reasoning-enabled Kimi-k2-thinking model. Importantly, this performance enhancement is most significant for mid-tier and open-source models, where F1 gains reach as high as 24.1\%, effectively enabling these widely accessible models to achieve extraction fidelity previously restricted to much larger architectures. This framework provides a scalable path toward automated construction of high-fidelity device databases, accelerating data-driven optoelectronic design.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…