Optimal bounds for p sensitivity sampling via 2 augmentation
Abstract
Data subsampling is one of the most natural methods to approximate a massively large data set by a small representative proxy. In particular, sensitivity sampling received a lot of attention, which samples points proportional to an individual importance measure called sensitivity. This framework reduces in very general settings the size of data to roughly the VC dimension d times the total sensitivity S while providing strong (1) guarantees on the quality of approximation. The recent work of Woodruff & Yasuda (2023c) improved substantially over the general O(-2 Sd) bound for the important problem of p subspace embeddings to O(-2 S2/p) for p∈[1,2]. Their result was subsumed by an earlier O(-2 Sd1-p/2) bound which was implicitly given in the work of Chen & Derezinski (2021). We show that their result is tight when sampling according to plain p sensitivities. We observe that by augmenting the p sensitivities by 2 sensitivities, we obtain better bounds improving over the aforementioned results to optimal linear O(-2( S+d)) = O(-2d) sampling complexity for all p ∈ [1,2]. In particular, this resolves an open question of Woodruff & Yasuda (2023c) in the affirmative for p ∈ [1,2] and brings sensitivity subsampling into the regime that was previously only known to be possible using Lewis weights (Cohen & Peng, 2015). As an application of our main result, we also obtain an O(-2μ d) sensitivity sampling bound for logistic regression, where μ is a natural complexity measure for this problem. This improves over the previous O(-2μ2 d) bound of Mai et al. (2021) which was based on Lewis weights subsampling.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.