From Scarce Functional Labels to Label-Aware Generation in Homologous Protein Families
Abstract
Accurately annotating and controlling protein function from sequence data remains a major challenge in protein engineering, especially when functional labels are scarce within large homologous families. Here, we study a two-stage light-supervision strategy for fine-grained functional annotation and label-aware sequence generation. First, we compare several sequence representations, including one-hot encodings, Restricted Boltzmann Machines (RBMs), and ESM2-based protein language model embeddings, for predicting intra-family specificity labels from limited supervision. By using train/test splits that explicitly reduce phylogenetic leakage, we show that ESM2-based representations do not systematically outperform family-specific RBM embeddings or even simple one-hot baselines in this regime. Second, we use the inferred annotations to train an annotation-aware RBM capable of generating artificial homologs conditioned on prescribed labels. Across several protein families, we quantify how the number and quality of available labels determine the reliability of conditional generation. Our results show that scarce annotations can support label-aware protein design when they are accurately propagated, while also highlighting the importance of phylogeny-aware evaluation for assessing functional annotation methods within homologous families.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.