Brain-Inspired Stochastic Joint Embedding Representation Learning
Abstract
Representation learning is one of the key research topics in machine learning, and the framework of self-supervised learning (SSL) has revolutionized computer vision. However, these approaches have not yet fully leveraged insights from biological visual processing systems. In this paper, we introduce PhiNet v2, a novel architecture that processes temporal visual input (i.e., sequences of images) without relying on strong data augmentation, enabling it to learn robust visual representations in a manner similar to human visual processing. Our learning objective is derived from variational inference. Through extensive experiments, we demonstrate that PhiNet v2 achieves competitive performance compared to state-of-the-art vision representation models, including RSP and CropMAE, while retaining the ability to learn effectively from sequential input without strong data augmentation. This work represents a step toward more biologically plausible computer vision systems that process visual information in a manner more aligned with human cognitive processes.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.