Optimal Phylogenetic Reconstruction from Sampled Quartets
Abstract
Quartet Reconstruction, the task of recovering a phylogenetic tree from smaller trees on four species called quartets, is a well-studied problem in theoretical computer science with far-reaching connections to statistics, graph theory and biology. Given a random sample containing m noisy quartets, labeled by an unknown ground-truth tree T on n taxa, we want to output a tree T that is close to T in terms of quartet distance and can predict unseen quartets. Unfortunately, the empirical risk minimizer corresponds to the NP-hard problem of finding a tree that maximizes agreements with the sampled quartets, and earlier works in approximation algorithms gave (1-)-approximation schemes (PTAS) for dense instances with m=(n4) quartets, or for m=(n2 n) quartets randomly sampled from T. Prior to our work, it was unknown how many samples are information-theoretically required to learn the tree, and whether there is an efficient reconstruction algorithm. We present optimal results for reconstructing an unknown phylogenetic tree T from a random sample of m=(n) quartets, corrupted under the Random Classification Noise (RCN) model. This matches the (n) lower bound required for any meaningful tree reconstruction. Our contribution is twofold: first, we give a tree reconstruction algorithm that, not only achieves a (1-)-approximation, but most importantly recovers a tree close to T in quartet distance; second, we show a new (n) bound on the Natarajan dimension of phylogenies (an analog of VC dimension in multiclass classification). Our analysis relies on a new Quartet-based Embedding and Detection procedure that identifies and removes well-clustered subtrees from the (unknown) ground-truth T via semidefinite programming.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.