Asymptotically Optimal Sequential Testing with Heterogeneous LLMs

Abstract

We study a Bayesian binary sequential hypothesis testing problem with multiple large language models (LLMs). Each LLM j has per-query cost cj>0, random waiting time with mean μj>0 and sub-Gaussian tails, and asymmetric accuracies: the probability of returning the correct label depends on the true hypothesis θ∈\A,B\ and needs not be the same under A and B. This asymmetry induces two distinct information rates (Ij,A, Ij,B) per LLM, one under each hypothesis. The decision-maker chooses LLMs sequentially, observes their noisy binary answers, and stops when the posterior probability of one hypothesis exceeds 1-α. The objective is to minimize the sum of expected query cost and expected waiting cost, E[Cπ] + E[g(Wπ)], where Cπ is the total query cost, Wπ is the total waiting time and g is a polynomial function (e.g., g(x)=x with 1). We prove that as the error tolerance α0, the optimal policy is asymptotically equivalent to one that uses at most two LLMs. In this case, a single-LLM policy is not generically optimal: optimality now requires exploiting a two-dimensional tradeoff between information under A and information under B. Any admissible policy induces an expected information-allocation vector in R+2, and we show that the optimal allocation lies at an extreme point of the associated convex set when α is relatively small, and hence uses at most two LLMs. We construct belief-dependent policies that first mix between two LLMs when the posterior is ambiguous, and then switch to a single "specialist" LLM when the posterior is sufficiently close to one of the hypotheses. These policies match the universal lower bound up to a (1+o(1)) factor as α→ 0.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…