Col-Bandit: Query-Time Top-K Estimation for Late-Interaction Retrieval

Abstract

Multi-vector late-interaction retrievers such as ColBERT achieve state-of-the-art quality, but their query-time cost is dominated by exhaustively computing token-level MaxSim interactions for every candidate document. The MaxSim scores of N candidates against T query tokens form an N× T matrix whose row-sums are the late-interaction scores, and identifying the top-K rarely requires every entry. We introduce Col-Bandit, a query-time estimator of the exhaustive-MaxSim top-K: it reveals matrix entries in batches, maintains a finite-population Bernstein-Serfling confidence interval on each candidate's score, and permanently drops any document whose upper bound falls below the K-th largest lower bound, computing only the cells needed to separate the top-K. A single relaxation knob αef∈(0,1] tunes the compute-fidelity trade-off. We deploy αef=0.2, while αef=1 admits a δ-PAC guarantee under a simplified radius. On BEIR and REAL-MM-RAG, Col-Bandit preserves ≥ 90\% fidelity to the exhaustive top-5 on every corpus while cutting MaxSim FLOPs by up to 8×, for up to 13× single-thread CPU speedups across x86 and ARM. A drop-in reranking layer, it needs no retraining or index changes.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…