On Minimizers of Minimum Density
Abstract
Minimizers are sampling schemes with numerous applications in computational biology. Assuming a fixed alphabet of size σ, a minimizer is defined by two integers k,w2 and a linear order on strings of length k (also called k-mers). A string is processed by a sliding window algorithm that chooses, in each window of length w+k-1, its minimal k-mer with respect to . A key characteristic of the minimizer is its density, which is the expected frequency of chosen k-mers among all k-mers in a random infinite σ-ary string. Minimizers of smaller density are preferred as they produce smaller samples with the same guarantee: each window is represented by a k-mer. The problem of finding a minimizer of minimum density for given input parameters (σ,k,w) has a huge search space of (σk)! and is representable by an ILP of size (σk+w), which has worst-case solution time that is doubly-exponential in (k+w) under standard complexity assumptions. We solve this problem in w· 2σk+O(k) time and provide several additional tricks reducing the practical runtime and search space. As a by-product, we describe an algorithm computing the average density of a minimizer within the same time bound. Then we propose a novel method of studying minimizers via regular languages and show how to find, via the eigenvalue/eigenvector analysis over finite automata, minimizers with the minimal density in the asymptotic case w∞. Implementing our algorithms, we compute the minimum density minimizers for (σ,k)∈\(2,2),(2,3),(2,4),(2,5),(4,2)\ and all w 2. The obtained densities are compared against the average density and the theoretical lower bounds, including the new bound presented in this paper.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.