Estimating phylogenetic distances between genomic sequences based on the length distribution of k-mismatch common substrings

Abstract

Various approaches to alignment-free sequence comparison are based on the length of exact or inexact word matches between two input sequences. Haubold et al. (2009) showed how the average number of substitutions between two DNA sequences can be estimated based on the average length of exact common substrings. In this paper, we study the length distribution of k-mismatch common substrings between two sequences. We show that the number of substitutions per position that have occurred since two sequences have evolved from their last common ancestor, can be estimated from the position of a local maximum in the length distribution of their k-mismatch common substrings.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…