LineageFlow: Flow Matching for High-Fidelity Family-Aware Protein Sequence Generation

Abstract

Protein sequence generation for engineering requires samples that are biophysically plausible and, when targeting a family/domain, remain recognizable members while exploring within-family diversity. Current discrete generative models typically start from uniform or masked-token noise, which discards strong position-specific constraints induced by evolution and forces the model to reconstruct conserved residues from scratch, leading to weak family control and low plausibility. We propose LineageFlow, a Dirichlet flow-matching model that initializes generation from lineage priors derived from ancestral sequence reconstruction, turning generation into structured mutation from an evolved scaffold. Across diverse protein families, LineageFlow achieves family validity close to held-out natural sequences and improves predicted structural confidence over uniform-/mask-initialized baselines while maintaining substantial novelty and diversity. Finally, we introduce rerouting, a single intermediate-time mutate--select--amplify intervention that enables objective-guided sampling without per-step predictor guidance and yields further gains in plausibility, including a zero-shot enzyme generation case study. Code is available at https://github.com/Jinx-byebye/LineageFlow.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…