Optimal Top-k Document Retrieval

Abstract

Let D be a collection of D documents, which are strings over an alphabet of size σ, of total length n. We describe a data structure that uses linear space and and reports k most relevant documents that contain a query pattern P, which is a string of length p, in time O(p/σ n+k), which is optimal in the RAM model in the general case where D = ( n), and involves a novel RAM-optimal suffix tree search. Our construction supports an ample set of important relevance measures... [clip] When D = o( n), we show how to reduce the space of the data structure from O(n n) to O(n(σ+ D+ n)) bits... [clip] We also consider the dynamic scenario, where documents can be inserted and deleted from the collection. We obtain linear space and query time O(p( n)2/σ n+ n + k k), whereas insertions and deletions require O(1+ε n) time per symbol, for any constant ε>0. Finally, we consider an extended static scenario where an extra parameter par(P,d) is defined, and the query must retrieve only documents d such that par(P,d)∈ [τ1,τ2], where this range is specified at query time. We solve these queries using linear space and O(p/σ n + 1+ε n + kε n) time, for any constant ε>0. Our technique is to translate these top-k problems into multidimensional geometric search problems. As an additional bonus, we describe some improvements to those problems.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…