Dynamic Enumeration of Similarity Joins
Abstract
This paper considers enumerating answers to similarity-join queries under dynamic updates: Given two sets of n points A,B in Rd, a metric φ(·), and a distance threshold r > 0, report all pairs of points (a, b) ∈ A × B with φ(a,b) r. Our goal is to store A,B into a dynamic data structure that, whenever asked, can enumerate all result pairs with worst-case delay guarantee, i.e., the time between enumerating two consecutive pairs is bounded. Furthermore, the data structure can be efficiently updated when a point is inserted into or deleted from A or B. We propose several efficient data structures for answering similarity-join queries in low dimension. For exact enumeration of similarity join, we present near-linear-size data structures for 1, ∞ metrics with O(1) n update time and delay. We show that such a data structure is not feasible for the 2 metric for d 4. For approximate enumeration of similarity join, where the distance threshold is a soft constraint, we obtain a unified linear-size data structure for p metric, with O(1) n delay and update time. In high dimensions, we present an efficient data structure with worst-case delay-guarantee using locality sensitive hashing (LSH).
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.