Improved Scaling for Fast Mode of Ozaki Scheme II

Abstract

Ozaki scheme II emulates high-precision matrix multiplication using low-precision integer matrix operations based on the Chinese remainder theorem (CRT). It first scales the high-precision matrices to convert them into integer matrices. For this scaling step, Ozaki scheme II provides two modes: accurate mode, which uses INT8 matrix multiplication to estimate scaling factors, and fast mode, which applies the Cauchy--Schwarz inequality at lower computational cost. We show that the existing formula lacks scale invariance; multiplying the input matrices by a constant changes the effective bit width of the integer matrices in the scaling step, causing accuracy degradation or CRT recovery failure. To address this, we propose a revised scaling formula derived from the CRT uniqueness condition via the Cauchy--Schwarz inequality. The proposed formula is scale-invariant by construction, guarantees that the CRT uniqueness condition is always satisfied, and introduces no additional overhead over the original fast mode. Experiments on an NVIDIA GH200 GPU show that the proposed method achieves accuracy comparable to that of accurate mode while maintaining throughput comparable to that of fast mode. In the accuracy--throughput trade-off, the proposed method overcomes the accuracy limitation of fast mode and the throughput constraint of accurate mode, offering a superior accuracy and performance.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…