On the Complexity of Finding Approximate LCS of Multiple Strings

Abstract

Finding an Approximate Longest Common Substring (ALCS) within a given set S=\s1,s2,…,sm\ of m 2 strings is a key problem in computational biology, such as identifying related mutations across multiple genetic sequences. We study several variants of ALCS problems that, given integers k and t m, seek the longest string u -- or the longest substring u of any string in S -- that lies within distance k of at least one substring in t distinct strings from S. While the general problems are NP-hard, we present efficient algorithms for restricted cases under Hamming and edit distances using the LCPk and k-errata tree data structures. Our methods achieve run times of O(N2), O(k N2), and O(mNk ), where is the length of the longest string and N is the sum of the lengths of all the strings in S. We also establish conditional lower bounds under the Strong Exponential Time Hypothesis and extend our study to indeterminate strings.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…