$LCSk$++: Practical similarity metric for long strings

Mile Šikić

LCSk++: Practical similarity metric for long strings

Abstract

In this paper we present LCSk++: a new metric for measuring the similarity of long strings, and provide an algorithm for its efficient computation. With ever increasing size of strings occuring in practice, e.g. large genomes of plants and animals, classic algorithms such as Longest Common Subsequence (LCS) fail due to demanding computational complexity. Recently, Benson et al. defined a similarity metric named LCSk. By relaxing the requirement that the k-length substrings should not overlap, we extend their definition into a new metric. An efficient algorithm is presented which computes LCSk++ with complexity of O((|X|+|Y|)(|X|+|Y|)) for strings X and Y under a realistic random model. The algorithm has been designed with implementation simplicity in mind. Additionally, we describe how it can be adjusted to compute LCSk as well, which gives an improvement of the O(|X||Y|) algorithm presented in the original LCSk paper.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Or open the topic learn hub

Discussion (0)

Sign in to join the discussion.

Loading comments…