Triple Attention Network architecture for MovieQA

Abstract

Movie question answering, or MovieQA is a multimedia related task wherein one is provided with a video, the subtitle information, a question and candidate answers for it. The task is to predict the correct answer for the question using the components of the multimedia - namely video/images, audio and text. Traditionally, MovieQA is done using the image and text component of the multimedia. In this paper, we propose a novel network with triple-attention architecture for the inclusion of audio in the Movie QA task. This architecture is fashioned after a traditional dual attention network focused only on video and text. Experiments show that the inclusion of audio using the triple-attention network results provides complementary information for Movie QA task which is not captured by visual or textual component in the data. Experiments with a wide range of audio features show that using such a network can indeed improve MovieQA performance by about 7% relative to just using only visual features.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…