Impossibility of phylogeny reconstruction from k-mer counts

Abstract

We consider phylogeny estimation under a two-state model of sequence evolution by site substitution on a tree. In the asymptotic regime where the sequence lengths tend to infinity, we show that for any fixed k no statistically consistent phylogeny estimation is possible from k-mer counts over the full leaf sequences alone. Formally, we establish that the joint distribution of k-mer counts over the entire leaf sequences on two distinct trees have total variation distance bounded away from 1 as the sequence length tends to infinity. Our impossibility result implies that statistical consistency requires more sophisticated use of k-mer count information, such as block techniques developed in previous theoretical work.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…