Evaluating LLM-generated code for domain-specific languages: molecular dynamics with LAMMPS

Alejandro Strachan

Evaluating LLM-generated code for domain-specific languages: molecular dynamics with LAMMPS

Abstract

Large language models (LLMs) are changing the way researchers interact with code and data in scientific computing. While their ability to generate general-purpose code is well established, their effectiveness in producing scientifically valid scripts for domain-specific language (DSLs) remains largely unexplored. We propose an evaluation procedure that enables domain experts to assess the validity of LLM-generated input files for LAMMPS, a widely used molecular dynamics (MD) code, without requiring deep familiarity with its syntax. The evaluation procedure combines a normalization step that produces canonical input files with an extensible parser for syntax analysis, followed by a reduced-cost execution stage and accuracy checks that isolate common errors before running costly simulations. We apply the pipeline to eight state-of-the-art LLMs across three prompts of increasing complexity. The parser pass rate has improved from 74% to 91% over the past year, but scientific accuracy on coupled multi-step workflows remains limited. Across all 80 scripts evaluated on the most complex prompt, only one was fully correct as generated. We further package the automated stages as a reusable agentic skill that LLMs can invoke during script generation; in a small-scale demonstration, this skill helped two models produce five fully correct scripts out of six across the same three prompts, including the hardest one. The pipeline highlights both the limitations of current LLMs in generating scientific DSLs and a practical path toward integrating them into domain-specific computational ecosystems.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…