What Matters for Grocery Product Retrieval with Open Source Vision Language Models
Abstract
Multimodal product retrieval (MPR) underpins checkout-free retail and automated inventory systems, yet it demands fine-grained SKU discrimination that standard vision-language benchmarks fail to capture. We present the first systematic zero-shot evaluation of 190 open-source VLMs on the MPR task of the GroceryVision Challenge, isolating pre-training data, architecture, and input resolution. Our analysis yields three actionable findings. (1) Data quality trumps scale. Switching from raw web-scrapes to filtered datasets delivers up to 16.6\% accuracy gains, exceeding the benefit of doubling model parameters. (2) Efficient models can win. MobileCLIP-B (150M parameters) outperforms 351M counterparts trained on noisy data. We introduce semantic power density (ϕ), an efficiency metric that penalizes sub-threshold accuracy. (3) A precision gap persists. State-of-the-art models achieve 94.5\% Recall@5 but suffer a 17.5\% drop at Recall@1, revealing that contrastive embeddings cluster categories effectively but fail to rank visually similar SKUs. Code and evaluation scripts are available at https://github.com/upeee/openmpr.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.