Data-Efficient Training by Evolved Sampling
Abstract
Data selection is designed to accelerate learning with preserved performance. To achieve this, a fundamental thought is to identify informative data samples with significant contributions to the training. In this work, we propose Evolved Sampling (ES), a simple yet effective framework for dynamic sampling along the training process. This method conducts batch level data selection based on the dynamics of losses and augmented loss differences, which enables flexible frequency tuning, and hence significantly reduces the back propagation time with maintained model performance. Due to its conciseness, ES is also readily extensible to incorporate set level data selection (to form ES with pruning, ESWP) for further accelerations. As a plug-and-play framework, ES(WP) consistently achieves lossless training accelerations across various pre-training and post-training tasks, saving up to nearly 45\% wall-clock time. Our results motivate further investigations on the data efficiency aspect of modern large-scale machine learning.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.