Cache-Aware I/O Cost Modeling for Disk-Based Learned Indexes

Abstract

Learned indexes have shown attractive space-time trade-offs in main-memory settings, yet a principled I/O cost model for their disk-resident deployments is still missing, which is a prerequisite for index tuning and query optimization. The practically employed page buffer makes the problem even harder: under typical cache policies, many of the logical page references issued by the index are served by the buffer rather than reaching disk, so the effective physical I/O depends jointly on the workload, the cache policy, and the index configuration. In this paper, we propose CAM, the first cache-aware I/O cost model for learned indexes that takes practical cache eviction policies into consideration. CAM is not tied to a particular learned index design: it estimates page access distributions without full trace replay for mainstream learned index designs, and then combines them with I/O cost models to estimate effective physical I/Os. This formulation enables principled knob tuning by explicitly modeling the trade-off between index footprint and buffer capacity. We instantiate CAM for disk-based PGM-index and RMI, and further apply the same modeling principle to learned-index-based joins through a hybrid strategy that adaptively chooses point or range probes based on local key density. Extensive experiments on real benchmarks show that CAM provides accurate and efficient I/O estimation across diverse workloads: CAM-guided tuning improves PGM throughput by 1.17× over multicriteria PGM tuning and improves RMI throughput by 1.66× over CDFShop with I/O-related considerations. For learned-index-based joins, our hybrid strategy improves end-to-end performance by up to 8.8× over disk-based index nested-loop join.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…