Massively Parallel Algorithms and Hardness for Single-Linkage Clustering Under p-Distances
Abstract
We present massively parallel (MPC) algorithms and hardness of approximation results for computing Single-Linkage Clustering of n input d-dimensional vectors under Hamming, 1, 2 and ∞ distances. All our algorithms run in O( n) rounds of MPC for any fixed d and achieve (1+ε)-approximation for all distances (except Hamming for which we show an exact algorithm). We also show constant-factor inapproximability results for o( n)-round algorithms under standard MPC hardness assumptions (for sufficiently large dimension depending on the distance used). Efficiency of implementation of our algorithms in Apache Spark is demonstrated through experiments on a variety of datasets exhibiting speedups of several orders of magnitude.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.