Various improvements to text fingerprinting
Abstract
Let s = s1 .. sn be a text (or sequence) on a finite alphabet of size σ. A fingerprint in s is the set of distinct characters appearing in one of its substrings. The problem considered here is to compute the set F of all fingerprints of all substrings of s in order to answer efficiently certain questions on this set. A substring si .. sj is a maximal location for a fingerprint f in F (denoted by <i,j>) if the alphabet of si .. sj is f and si-1, sj+1, if defined, are not in f. The set of maximal locations ins is L (it is easy to see that | L| ≤ n σ). Two maximal locations <i,j> and <k,l> such that si .. sj = sk .. sl are named copies, and the quotient set of L according to the copy relation is denoted by LC. We present new exact and approximate efficient algorithms and data structures for the following three problems: (1) to compute F; (2) given f as a set of distinct characters in , to answer if f represents a fingerprint in F; (3) given f, to find all maximal locations of f in s.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.