Executing as You Generate: Hiding Execution Latency in LLM Code Interpreters

Abstract

Current LLM systems are increasingly equipped with a code interpreter that executes generated code to obtain results. This works serially: the model first generates the complete code, then an interpreter executes it. This sequential workflow leaves the executor idle during generation and the generator idle during execution, resulting in unnecessary end-to-end latency. Our key observation is that an LLM, unlike a human developer, emits code tokens left to right and does not backtrack over what it has already written. This makes it possible to start executing a piece of code while later tokens are still being generated. We formalize this parallel execution paradigm, modeling it as a three-stage pipeline of generation, detection, and execution, and derive closed-form latency bounds that characterize its speedup potential and operating regimes. We then present EAGER, a concrete implementation featuring AST-based chunking, dynamic batching with gated execution, and early error interruption. We evaluate EAGER across four benchmarks, seven LLMs, and three execution environments. The overlap mechanism hides almost all execution behind generation, reducing the non-overlapped portion of execution time by up to 99.8% and cutting end-to-end latency by up to 37.3% on error-free runs.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…