Skill Is Not Document: A Query-Conditional Benchmark and Two-Stage Retriever for LLM Agent Skill Routing

Xing Sun

Skill Is Not Document: A Query-Conditional Benchmark and Two-Stage Retriever for LLM Agent Skill Routing

Abstract

LLM agents often solve complex tasks by composing skills, making skill retrieval a front-end component of agent systems. Unlike document retrieval, top-K correctness in skill retrieval depends not only on the relevance of each query-skill pair, but also on whether the retrieved skills can work together under the query. This query-conditioned "skill compatibility" cannot be recovered from independent relevance alone. However, LLM-based synthesis pipelines already produce a useful signal for it: the LLM's own rejection decisions, which specify which skills should not be retrieved together for a given query, but are usually discarded as low-quality data. We propose Reject-as-Resource Retriever (R3) and construct R3-Skill, a bilingual (Chinese-English) benchmark for agent skill routing. R3-Skill covers four language directions and uses LLM-rewritten queries that better approximate user requests; its test-set ground truth is verified by multiple experts. It contains 10,246 skills grouped into 8 thematic super-domains, 41,592 accepted queries, and 32,828 LLM-rejected annotations, further organized into an 8-class rejection-reason taxonomy. R3-Skill keeps this normally discarded rejection signal and uses it as compatibility supervision. On R3-Skill, we train a two-stage retriever consisting of R3-Embedding and R3-Reranker. Gradient analysis explains why this query-conditional signal is weak when injected into the tested bi-encoder objective under bilateral balancing, while a cross-encoder can use it as graded ranking supervision; R3-Skill ablations support this split. The R3-Embedding + R3-Reranker pipeline reaches Hit@1 = 0.7521, NDCG@10 = 0.8173 and Set-Compat = 0.3188 on R3-Skill. The dataset, model weights, and evaluation scripts will be open-sourced.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…