Multi-Scale Matrix Sampling and Sublinear-Time PageRank Computation
Abstract
A fundamental problem arising in many applications in Web science and social network analysis is, given an arbitrary approximation factor c>1, to output a set S of nodes that with high probability contains all nodes of PageRank at least , and no node of PageRank smaller than /c. We call this problem SignificantPageRanks. We develop a nearly optimal, local algorithm for the problem with runtime complexity O(n/) on networks with n nodes. We show that any algorithm for solving this problem must have runtime of (n/), rendering our algorithm optimal up to logarithmic factors. Our algorithm comes with two main technical contributions. The first is a multi-scale sampling scheme for a basic matrix problem that could be of interest on its own. In the abstract matrix problem it is assumed that one can access an unknown right-stochastic matrix by querying its rows, where the cost of a query and the accuracy of the answers depend on a precision parameter ε. At a cost propositional to 1/ε, the query will return a list of O(1/ε) entries and their indices that provide an ε-precision approximation of the row. Our task is to find a set that contains all columns whose sum is at least , and omits any column whose sum is less than /c. Our multi-scale sampling scheme solves this problem with cost O(n/), while traditional sampling algorithms would take time ((n/)2). Our second main technical contribution is a new local algorithm for approximating personalized PageRank, which is more robust than the earlier ones developed in JehW03,AndersenCL06 and is highly efficient particularly for networks with large in-degrees or out-degrees. Together with our multiscale sampling scheme we are able to optimally solve the SignificantPageRanks problem.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.