End-to-End Facial Expression Detection in Long Videos

Abstract

Facial expression detection requires spotting when expressions occur and recognizing which emotional category they belong to. Despite their close relationships, existing approaches typically address these tasks separately, limiting performance and robustness in real-world settings. In this work, we propose FEDN, a Facial Expression Detection Network, which unifies spotting and recognition into a single detection task performed fully end-to-end. FEDN introduces two temporal attention modules, segment-level attention to capture fine-grained local dynamics and sliding window attention to capture the broader temporal context. Their output is combined in a multi-scale temporal feature pyramid, which enables spotting of expressions with varying duration. This unified framework enables joint optimization and shared representation learning across tasks. FEDN outperforms strong baselines in both spotting and detection on three public benchmarks, demonstrating the effectiveness of unifying spotting and recognition across multiple temporal scales. Additionally, we uncover a previously unreported discrepancy between expert-annotated and self-reported emotion labels, highlighting a key challenge in expression benchmarking and motivating the development of more nuanced annotation protocols.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…