CAFS: A Cache-Aware Frequency Sort for Low-Cardinality Integer Data on x86-64

Abstract

Integer sorts in OLAP engines often run on columns whose cardinality K is much smaller than the array length N. After a group-by stage the intermediate key column has K bounded by the number of distinct group keys, and even a column-store scan typically operates on dictionary-encoded categorical fields where K never exceeds a few thousand. A comparison sort on such a column still pays Θ(N N) comparisons, and a radix sort still pays Θ(N · B/b) byte passes, irrespective of K. This paper describes CAFS, an integer sort that does exploit it on x86-64 with AVX2. The algorithm combines a SIMD bucket sized to one cache line, a Chao1 cardinality estimator over 1024 strided samples (kept in a heap-allocated 40 KB open-addressing table), and an adaptive dispatcher backed by a spill safety guard. The hot loop is branchless and uses AVX2 cmpeq together with movemask and tzcnt to locate the matching lane. We benchmarked CAFS on a full-factorial grid of 58 array sizes N from 103 to 3 · 107 with dense K schedules per N, producing 592770 timed runs against pdqsort, IPS4o, vqsort, skasort, and std::sort. In the K N band the throughput is 1.7 to 3.1x that of pdqsort, 1.7 to 3.5x IPS4o, and 1.2 to 2.3x vqsort. The operational crossover against pdqsort is at K ≈ 1.3 · 105; against skasort, K ≈ 8.14 · 105; against vqsort, K ≈ 6.7 · 105; and against IPS4o the curves only converge near K = N. Of the five baselines, only vqsort actually overtakes CAFS once the crossover is passed, which makes the vqsort threshold at K ≈ 6.7 · 105 the binding constraint on the operational range of CAFS.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…