LATS-RCA: Language Agent Tree Search for Root Cause Analysis in Microservices

Abstract

Recent advances in large language models (LLMs) have enabled early attempts to automate root cause analysis (RCA) in microservice systems (MSS). However, existing approaches typically rely on a linear reasoning process that proceeds along a single diagnostic path. In this paper, we propose the Language Agent Tree Search for RCA (LATS-RCA) in MSS. LATS-RCA formulates RCA as a reflection-guided tree-structured search over root-cause hypotheses, where multiple agents iteratively analyze logs and metrics to collect evidence, and reflection scores guide the search toward the most likely root causes for abnormal services. We evaluate LATS-RCA on the open benchmark (LO2), achieving 91.3\% diagnostic accuracy and analyzing the associated computational cost. Variation among the frontier-tier LLMs (Claude Sonnet 4.5, GPT-5, and Gemini 3 Pro) is small, between 89.7\% and 91.3\%, demonstrating our approach is model-agnostic. We also conduct an exploratory study by evaluating LATS-RCA on real-world incidents from a web-hosting company's (Zoner Oy) production MSS that serves over 300,000 websites across Europe. We find that LATS-RCA correctly diagnoses 65.1\% of the production incidents on average over multiple runs. This reveals key challenges of real-world RCA, including multi-factor root causes, large-scale system complexity, and incomplete observability, which are absent from open benchmarks. Future work should develop more realistic open datasets for RCA and validate LATS-RCA with additional datasets. Our replication package is available at https://github.com/kottinov/lats-rca.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…