Constructing Antidictionaries in Output-Sensitive Space

Abstract

A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y1,y2,…,yk over an alphabet , we are asked to compute the set My1\#…\#yk of minimal absent words of length at most of word y=y1\#y2\#…\#yk, \#. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires (n) space for n=|y| using any of the plenty available O(n)-time algorithms. This is because an (n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ||My1\#…\#yN||=o(n), for all N∈[1,k]. For instance, in the human genome, n ≈ 3× 109 but ||M12y1\#…\#yk|| ≈ 106. We consider a constant-sized alphabet for stating our results. We show that all My1,…,My1\#…\#yk can be computed in O(kn+ΣkN=1||My1\#…\#yN||) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in \y1,…,yk\ and MaxOut=\||My1\#…\#yN||:N∈[1,k]\. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.

0

Discussion (0)

Sign in to join the discussion.

Loading comments…