EgoExo-Con: Exploring View-Invariant Video Temporal Understanding

Abstract

Do Video-LLMs have consistent temporal understanding when videos capture the same event from different viewpoints? To study this question, we introduce EgoExo-Con(sistency), a benchmark of synchronized egocentric and exocentric video pairs with human-refined queries that ensure all concepts are visible in both viewpoints. EgoExo-Con emphasizes two temporal understanding tasks: Temporal Verification and Temporal Grounding. It evaluates not only correctness but consistency across viewpoints. Our analysis reveals two critical limitations of existing Video-LLMs: (1) models often fail to maintain consistency, with results far worse than their single-view performances. (2) When naively finetuned with synchronized videos of both viewpoints, the models show improved consistency but often underperform those trained on a single view. For improvements, we propose View-GRPO, a novel reinforcement learning framework that effectively strengthens view-specific temporal reasoning while encouraging consistent comprehension across viewpoints. Our method demonstrates its superior temporal understanding capabilities, especially for improving cross-view consistency. All resources have been made available at https://minjoong507.github.io/projects/EgoExo-Con/

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…