Various improvements to text fingerprinting

Mathieu Raffinot

Various improvements to text fingerprinting

Abstract

Let s = s1 .. sn be a text (or sequence) on a finite alphabet of size σ. A fingerprint in s is the set of distinct characters appearing in one of its substrings. The problem considered here is to compute the set F of all fingerprints of all substrings of s in order to answer efficiently certain questions on this set. A substring si .. sj is a maximal location for a fingerprint f in F (denoted by <i,j>) if the alphabet of si .. sj is f and si-1, sj+1, if defined, are not in f. The set of maximal locations ins is L (it is easy to see that | L| ≤ n σ). Two maximal locations <i,j> and <k,l> such that si .. sj = sk .. sl are named copies, and the quotient set of L according to the copy relation is denoted by LC. We present new exact and approximate efficient algorithms and data structures for the following three problems: (1) to compute F; (2) given f as a set of distinct characters in , to answer if f represents a fingerprint in F; (3) given f, to find all maximal locations of f in s.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…