Almost Linear Size Edit Distance Sketch
Abstract
Edit distance is an important measure of string similarity. It counts the number of insertions, deletions and substitutions one has to make to a string x to get a string y. In this paper we design an almost linear-size sketching scheme for computing edit distance up to a given threshold k. The scheme consists of two algorithms, a sketching algorithm and a recovery algorithm. The sketching algorithm depends on the parameter k and takes as input a string x and a public random string and computes a sketch sk(x;k), which is a digested version of x. The recovery algorithm is given two sketches sk(x;k) and sk(y;k) as well as the public random string used to create the two sketches, and (with high probability) if the edit distance ED(x,y) between x and y is at most k, will output ED(x,y) together with an optimal sequence of edit operations that transforms x to y, and if ED(x,y) > k will output LARGE. The size of the sketch output by the sketching algorithm on input x is k2O((n)(n)) (where n is an upper bound on length of x). The sketching and recovery algorithms both run in time polynomial in n. The dependence of sketch size on k is information theoretically optimal and improves over the quadratic dependence on k in schemes of Kociumaka, Porat and Starikovskaya (FOCS'2021), and Bhattacharya and Kouck\'y (STOC'2023).
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.