A 35B Hybrid-Attention Mixture-of-Experts Model on a 6GB 2011 GPU: Hand-Written 4-bit CUDA Inference for Fermi

J. Q. Lu

A 35B Hybrid-Attention Mixture-of-Experts Model on a 6GB 2011 GPU: Hand-Written 4-bit CUDA Inference for Fermi

Abstract

We report end-to-end inference of Qwen3.6-35B-A3B -- a 35-billion-parameter, 3B-active Mixture-of-Experts (MoE) model with a hybrid gated-delta-net / full-attention backbone -- on a 2011 NVIDIA Tesla C2075 (Fermi, compute capability , 6\,GB), a GPU that predates tensor cores, native FP16 arithmetic, the DP4A integer dot-product instruction, and support in every modern CUDA toolchain. Because the 4-bit model (≈10.5\,GB) is roughly twice the device memory, we adopt a hybrid execution strategy: the GPU performs batched prompt prefill with expert weights streamed layer-by-layer from host RAM, while decode runs on the host CPU using a hand-written W4A8 integer GEMV built on the SSSE3 pmaddubsw instruction. The entire engine -- GEMM, hybrid-attention recurrence, MoE routing, and a from-scratch vision tower -- is written by hand for and compiled with the legacy CUDA 8.0 toolchain. On a 947-token prompt we reduce prefill latency from 57.2\,s to 37.5\,s (-34\%) through expert pinning, single-pass prefill, and NUMA interleaving, and we raise decode throughput from 2.8 to 8.6\, (≈ 3×) with the integer-SIMD kernel. A position-indexed snapshot cache for the recurrent (gated-delta-net) state restores prefix reuse on a recurrent architecture, cutting a repeated 78\,s prefill to 0.5\,s. We also report a set of negative results -- offloading the language-model head to the idle GPU, hyper-threading, and three GPU-kernel rewrites all fail to help -- % which together pin down the practical floor of this hardware. Our aim is not a speed record but a careful account of what it takes, and where the walls are, to run a contemporary frontier-class MoE on fourteen-year-old silicon.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…