Extrapolation Guarantees for Perturbation Modeling Under the Additive Latent Shift Assumption
Abstract
We consider the problem of modeling the effects of perturbations like gene knockouts on measurements such as single-cell RNA counts. Given data for some perturbations, we aim to predict the distribution of measurements for new combinations of perturbations. To address this challenging extrapolation task, we posit that perturbations act additively in a suitable, unknown embedding space. We formulate the data-generating process as a latent variable model, in which perturbations amount to mean shifts in latent space and can be combined additively. We then prove that, given sufficiently diverse training perturbations, the representation and perturbation effects are identifiable up to orthogonal transformation and use this to derive extrapolation guarantees for unseen perturbations that can be expressed as linear combinations of seen ones. To estimate the model from data, we propose the perturbation distribution autoencoder (PDAE), which is trained by maximizing the distributional similarity between true and simulated perturbation distributions. The trained model can then be used to predict previously unseen perturbation distributions. In support of our theoretical results, we demonstrate through simulations that PDAE can accurately predict the effects of unseen but identifiable perturbations, and showcase the method on combinatorial gene perturbation data.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.