The exact probability law for the approximated similarity from the Minhashing method

Abstract

We propose a probabilistic setting in which we study the probability law of the Rajaraman and Ullman RU algorithm and a modified version of it denoted by RUM. These algorithms aim at estimating the similarity index between huge texts in the context of the web. We give a foundation of this method by showing, in the ideal case of carefully chosen probability laws, the exact similarity is the mathematical expectation of the random similarity provided by the algorithm. Some extensions are given. R\'esum\'e. Nous proposons un cadre probabilistique dans lequel nous \'etudions la loi de probabilit\'e de l'algorithme de Rajaraman et Ullman RU ainsi qu'une version modifi\'ee de cet algorithme not\'ee RUM. Ces alogrithmes visent \`a estimer l'indice de la similarit\'e entre des textes de grandes tailles dans le contexte du Web. Nous donnons une base de validit\'e de cette m\'ethode en montrant que pour des lois de probabilit\'es minutieusement choisies, la similarit\'e exacte est l'esp\'erance math\'ematique de la similarit\'e al\'eatoire donn\'ee par l'algorithme RUM. Des g\'en\'eralisations sont abord\'ees.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…