Stream-K++: Adaptive GPU GEMM Kernel Scheduling and Selection using Bloom Filters

Abstract

General matrix multiplication (GEMM) operations are the fundamental building blocks of computational domains including artificial intelligence (AI). As GPU architectures evolve and high-performance AI becomes increasingly important, optimizing GEMM performance becomes a fundamental problem that needs to be addressed. This paper introduces Stream-K++, an enhancement to the promising Stream-K GEMM scheduling algorithm for workload balancing. We expand Stream-K's scheduling policies from three to seven and implement an efficient solution selection mechanism using Bloom filters. Our approach rapidly eliminates up to 95.8% of unsuitable configurations while maintaining a 100% true-negative rate. Implemented using the AMD Composable Kernel library and evaluated on AMD Instinct MI250X GPUs, Stream-K++ demonstrates significant performance gains (up to 43%) in select scenarios. It remains competitive (within 20% of optimal) for 60-97.6% of problem sizes. Our flexible framework, implemented in the Open-sieve C++ library, allows for easy adaptation to new problem sizes, scheduling policies, or additional tuning parameters, paving the way for future optimizations in GPU-based GEMM operations.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…