Lightweight Gaussian Process Inference in C++ on Metal and CUDA
Abstract
Gaussian process (GP) inference in Python is dominated by libraries such as GPyTorch and GPflow, which are built on deep-learning frameworks and inherit their dispatch overhead and dependency footprint. We present LightGP, a dependency-free C++17 library for GP regression with Python bindings, supporting Apple Metal and NVIDIA CUDA backends alongside tuned CPU paths via Apple Accelerate and OpenBLAS. LightGP provides four inference paths -- exact Cholesky, matrix-free conjugate gradients, sparse variational free energy, and structured kernel interpolation with FFT -- covering problems from N=100 to N=500,000. On an Apple M4, LightGP CPU is 2.6--8.7× faster than GPyTorch CPU for exact GP and 1.5× faster for sparse GP at every scale tested. On an NVIDIA RTX~3060, LightGP CUDA is 2.3--6.7× faster than GPyTorch CUDA for exact GP up to N=2,048, with GPyTorch closing the gap at N=4,096. A fused matrix-free kernel-vector product on Metal achieves 32× over the explicit path at N=20,000 with O(N) memory, and an FFT-accelerated SKI matvec via Accelerate vDSP runs in sub-millisecond time at N=200,000. LightGP compiles as a single static library with zero external dependencies and is installable via pip install lightgp
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.