Coresets for Multiple p Regression

Abstract

A coreset of a dataset with n examples and d features is a weighted subset of examples that is sufficient for solving downstream data analytic tasks. Nearly optimal constructions of coresets for least squares and p linear regression with a single response are known in prior work. However, for multiple p regression where there can be m responses, there are no known constructions with size sublinear in m. In this work, we construct coresets of size O(-2d) for p<2 and O(-pdp/2) for p>2 independently of m (i.e., dimension-free) that approximate the multiple p regression objective at every point in the domain up to (1) relative error. If we only need to preserve the minimizer subject to a subspace constraint, we improve these bounds by an factor for all p>1. All of our bounds are nearly tight. We give two application of our results. First, we settle the number of uniform samples needed to approximate p Euclidean power means up to a (1+) factor, showing that (-2) samples for p = 1, (-1) samples for 1 < p < 2, and (1-p) samples for p>2 is tight, answering a question of Cohen-Addad, Saulpic, and Schwiegelshohn. Second, we show that for 1<p<2, every matrix has a subset of O(-1k) rows which spans a (1+)-approximately optimal k-dimensional subspace for p subspace approximation, which is also nearly optimal.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…