Numerically Stable Cholesky-QR on GPU via Mixed-Precision Randomized Preconditioning
Abstract
Cholesky-QR is among the fastest algorithms for computing the thin QR factorization of tall-and-skinny matrices on GPUs, relying entirely on BLAS-3 operations. However, it is numerically unstable: forming the Gram matrix squares the condition number, causing breakdown when κ2(A) 108. We present MRCQR (Mixed-Precision Randomized Cholesky-QR), a stable GPU algorithm that addresses this limitation. MRCQR uses a subsampled randomized trigonometric transform to construct a preconditioner Rs that reduces κ2(ARs-1) to near unity with high probability, then applies Cholesky-QR in double precision to the preconditioned matrix. The key insight -- supported by perturbation analysis -- is that the preconditioner requires far less accuracy than the final result: single (FP32) precision suffices when κ2(A) 108, and half (FP16) when κ2(A) 104. MRCQR produces an explicit orthogonal factor Q satisfying \|I - QQ\|2 = O(u) (u ≈ 10-16, double-precision unit roundoff) for condition numbers up to 1016, far beyond the 108 limit of CholQR2. Experiments on an NVIDIA H100 GPU show that MRCQR (FP16) outperforms rand-cholQR by 1.4--1.8× across all tested column counts and is 1.8--13.5× faster than cuSOLVER geqrf, while the FP16 sketch (used when κ2(A) 104) is 2× cheaper than FP64 at no accuracy cost.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.