Even Better Framework for min-wise Based Algorithms
Abstract
In a recent paper from SODA11 kminwise the authors introduced a general framework for exponential time improvement of based algorithms by defining and constructing almost independent family of hash functions. Here we take it a step forward and reduce the space and the independent needed for representing the functions, by defining and constructing a independent family of hash functions. Surprisingly, for most cases only 8-wise independent is needed for exponential time and space improvement. Moreover, we bypass the O(1ε) independent lower bound for approximately functions patrascu10kwise-lb, as we use alternative definition. In addition, as the independent's degree is a small constant it can be implemented efficiently. Informally, under this definition, all subsets of size d of any fixed set X have an equal probability to have hash values among the minimal k values in X, where the probability is over the random choice of hash function from the family. This property measures the randomness of the family, as choosing a truly random function, obviously, satisfies the definition for d=k=|X|. We define and give an efficient time and space construction of approximately independent family of hash functions. The degree of independent required is optimal, i.e. only O(d) for 2 d < k=O(dε2), where ε ∈ (0,1) is the desired error bound. This construction can be used to improve many based algorithms, such as sizeEstimationFramework,Datar02estimatingrarity,NearDuplicate,SimilaritySearch,DBLP:conf/podc/CohenK07, as will be discussed here. To our knowledge such definitions, for hash functions, were never studied and no construction was given before.