Chi-squared Amplification: Identifying Hidden Hubs
Abstract
We consider the following general hidden hubs model: an n × n random matrix A with a subset S of k special rows (hubs): entries in rows outside S are generated from the probability distribution p0 N(0,σ02); for each row in S, some k of its entries are generated from p1 N(0,σ12), σ1>σ0, and the rest of the entries from p0. The problem is to identify the high-degree hubs efficiently. This model includes and significantly generalizes the planted Gaussian Submatrix Model, where the special entries are all in a k × k submatrix. There are two well-known barriers: if k≥ cn n, just the row sums are sufficient to find S in the general model. For the submatrix problem, this can be improved by a n factor to k cn by spectral methods or combinatorial methods. In the variant with p0= 1 (with probability 1/2 each) and p1 1, neither barrier has been broken. We give a polynomial-time algorithm to identify all the hidden hubs with high probability for k n0.5-δ for some δ >0, when σ12>2σ02. The algorithm extends to the setting where planted entries might have different variances each at least as large as σ12. We also show a nearly matching lower bound: for σ12 2σ02, there is no polynomial-time Statistical Query algorithm for distinguishing between a matrix whose entries are all from N(0,σ02) and a matrix with k=n0.5-δ hidden hubs for any δ >0. The lower bound as well as the algorithm are related to whether the chi-squared distance of the two distributions diverges. At the critical value σ12=2σ02, we show that the general hidden hubs problem can be solved for k≥ c n( n)1/4, improving on the naive row sum-based method.