CAKL: Commutative algebra k-mer learning of genomics

Abstract

Despite the availability of various sequence analysis models, comparative genomic analysis remains a challenge in genomics, genetics, and phylogenetics. Commutative algebra, a fundamental tool in algebraic geometry and number theory, has rarely been used in data and biological sciences. In this study, we introduce commutative algebra k-mer learning (CAKL) as the first-ever nonlinear algebraic framework for analyzing genomic sequences. CAKL bridges between commutative algebra, algebraic topology, combinatorics, and machine learning to establish a new mathematical paradigm for comparative genomic analysis. We evaluate its effectiveness on three tasks -- genetic variant identification, phylogenetic tree analysis, and viral genome classification -- typically requiring alignment-based, alignment-free, and machine-learning approaches, respectively. Across eleven datasets, CAKL outperforms five state-of-the-art sequence analysis methods, particularly in viral classification, and maintains stable predictive accuracy as dataset size increases, underscoring its scalability and robustness. This work ushers in a new era in commutative algebraic data analysis and learning.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…