Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs
Abstract
We study post-training quantization (PTQ) of Ideogram 4.0, a 9.3B flow-matching diffusion transformer (DiT) that realizes classifier-free guidance with two separate-weight copies of a single-stream backbone and is conditioned by a Qwen3-VL text encoder, targeting Ampere RTX~3090 GPUs, which lack FP8 tensor cores. Because Ideogram~4.0 is trained on structured JSON captions, we evaluate every variant under schema-valid JSON prompts produced by an LLM expander built to Ideogram's published caption specification, and score them with a battery spanning human-preference (HPSv2), CLIP, and PickScore for standalone quality; PP-OCR exact-match and edit distance for text; and PSNR/SSIM/LPIPS for fidelity to the FP8 reference (the highest-precision public checkpoint) output. On a 300-prompt benchmark with paired bootstrap confidence intervals, an INT8 W8A8 recipe (per-channel weights, per-token dynamic activations, SmoothQuant, and bf16 protection of a small high-fragility layer set) is statistically indistinguishable from FP8 on CLIP and PickScore (paired CIs include zero) and within ~0.004 HPSv2, and, at its 8-bit size, is the most faithful reproduction of the FP8 output (LPIPS 0.243 vs 0.277/0.306 for the half-size 4-bit baselines; the INT8-Q4K gap excludes zero). A GGUF Q4K quantization reaches the same standalone quality as the published NF4 baseline at the same on-disk size, making it the Pareto choice on the quality-memory frontier. We further show that under JSON prompts all four variants reach parity on standalone quality, the variants separate on fidelity and text rendering, not on aggregate image-quality scores, and that text legibility, near-zero when the model is prompted with raw strings, reaches 55% OCR exact-match under the JSON captions it expects. We release the INT8 W8A8 and GGUF Q4K quantized weights on Hugging Face under a gated, non-commercial license.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.