AscendOptimizer: Episodic Agent for Ascend NPU Operator Optimization

Abstract

Optimizing AscendC (Ascend C) operators for Ascend NPUs is difficult for two reasons. First, unlike CUDA, the ecosystem offers few public kernels to learn from. Second, performance depends on a coupled two-part implementation: a host-side tiling program that controls data movement and a kernel program that schedules and pipelines computation. We present AscendOptimizer, an episodic agent that builds missing optimization knowledge from execution itself. For kernel optimization, AscendOptimizer rewinds strong implementations by removing optimizations in a controlled way, then keeps the changes whose removal measurably hurts performance as reusable experience for later rewriting. For host-side optimization, it runs profiling-in-the-loop evolutionary search to find valid, fast tiling and data-movement configurations directly from hardware feedback. This combination lets the agent improve kernel structure and host-side scheduling together. On a benchmark of 101 real AscendC operators, AscendOptimizer achieves a 1.21x geometric-mean speedup over the open-source baseline, and 53.47% of operators run faster than their references. Given a same budget of evaluations per operator, AscendOptimizer consistently outperforms Best-of-N sampling and OpenEvolve in terms of geometric mean speedup, fastp tail speedup ratios, and overall optimization progress across varying budgets.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…