All in an Aggregated Image for In-Image Learning
Abstract
This paper introduces a new in-context learning (ICL) mechanism called In-Image Learning (I2L) that combines demonstration examples, visual cues, and chain-of-thought reasoning into an aggregated image to enhance the capabilities of Large Multimodal Models (e.g., GPT-4V) in multimodal reasoning tasks. Unlike previous approaches that rely on converting images to text or incorporating visual input into language models, I2L consolidates all information into an aggregated image and leverages image processing, understanding, and reasoning abilities. This has several advantages: it reduces inaccurate textual descriptions of complex images, provides flexibility in positioning demonstration examples, and avoids multiple input images and lengthy prompts. We also introduce I2L-Hybrid, a method that combines the strengths of I2L with other ICL methods. Specifically, it uses an automatic strategy to select the most suitable method (I2L or another certain ICL method) for a specific task instance. We conduct extensive experiments to assess the effectiveness of I2L and I2L-Hybrid on MathVista, which covers a variety of complex multimodal reasoning tasks. Additionally, we investigate the influence of image resolution, the number of demonstration examples in a single image, and the positions of these demonstrations in the aggregated image on the effectiveness of I2L. Our code is publicly available at https://github.com/AGI-Edgerunners/IIL.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.