Faster Iterative φ Queries on the Positional BWT
Abstract
The Positional Burrows-Wheeler Transform (PBWT) is a fundamental data structure for the efficient representation and analysis of large-scale haplotype panels. For a panel of h sequences \S1, …, Sh\ over m sites, a key operation is the φj(i) query, which returns the haplotype index immediately preceding Si in co-lexicographic order at site j. Efficient support for k iterative queries φ1, …, φk is essential for haplotype matching and variation analysis. In this work, we introduce a simple and novel decomposition scheme that decomposes each haplotype row into sub-intervals, called refined segments, within which a haplotype's co-lexicographic predecessor for the sites remains unchanged. We show that refined segments satisfy two key properties: (i) each segment [b,e] associated with Si overlaps with at most a constant number of segments of Sφe(i), and (ii) the total number of segments is bounded by O(r + h), where r denotes the number of runs in the PBWT. Building on this decomposition, we present two space-time tradeoffs for supporting k iterative φ queries: (i) a structure using O((r + h) n) bits of space that answers k iterative queries in O( w (m,h) + k) time, where n = m · h, and (ii) a more compact structure using O(r h + h n) bits of space that supports queries in O(k w h) time. Prior to our work, supporting these queries required O((r + h) n) bits of space and O(k · w m) time. Our second tradeoff is expected to be effective in practice for modern genomic datasets, where the number h of haplotypes is typically much smaller than the number m of sites.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.