Large Language Models for Mobile GUI Text Input Generation: An Empirical Study
Abstract
Mobile apps have become essential, making quality assurance increasingly important. GUI testing is widely used for automated exploration, yet text-input components remain a major obstacle, as many UI pages require semantically appropriate text inputs before proceeding. Large Language Models have shown promise in generating context-aware text, but the effectiveness of different UI representations, feedback mechanisms, and human intervention remains unclear. This paper presents a large-scale empirical study addressing these gaps. We evaluate nine state-of-the-art LLMs across 115 real-world apps, comparing three UI-context prompting settings: extracted textual context, UI-hierarchy XML, and screenshot-based vision input. Results show extracted context and XML achieve comparable page-pass-through rates of 71.4% and 71.0%, while vision-based input reaches 65.1% but incurs substantially higher token costs. In bug-detection experiments with 37 real-world text-input bugs, LLMs generating invalid inputs detect about 51% of issues across all evaluated models. A feedback-enhanced protocol, incorporating execution outcomes into subsequent attempts, improves average PPTRs to 69.2-73.8% and raises bug-detection rates to 51.0-64.5%. Human testers further refine inputs, yielding additional gains. We integrate the process into DroidBot, augmenting its UI-exploration capabilities. We derive actionable insights on context selection, cost-effectiveness, feedback strategies, and human-LLM collaboration, advancing both knowledge and practice in Android testing.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.