Efficient Computation of Sequence Mappability

Abstract

In the (k,m)-mappability problem, for a given sequence T of length n, the goal is to compute a table whose ith entry is the number of indices j i such that the length-m substrings of T starting at positions i and j have at most k mismatches. Previous works on this problem focused on heuristics computing a rough approximation of the result or on the case of k=1. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that, for k=O(1), works in O(n) space and, with high probability, in O(n · \mk,k n\) time. Our algorithm requires a careful adaptation of the k-errata trees of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. Our technique can also be applied to solve the all-pairs Hamming distance problem introduced by Crochemore et al. [WABI 2017]. We further develop O(n2)-time algorithms to compute all (k,m)-mappability tables for a fixed m and all k∈ \0,…,m\ or a fixed k and all m∈\k,…,n\. Finally, we show that, for k,m = ( n), the (k,m)-mappability problem cannot be solved in strongly subquadratic time unless the Strong Exponential Time Hypothesis fails. This is an improved and extended version of a paper that was presented at SPIRE 2018.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…