Multi-block/multi-core SSOR preconditioner for the QCD quark solver for K computer

Abstract

We study the algorithmic optimization and performance tuning of the Lattice QCD clover-fermion solver for the K computer. We implement the L\"uscher's SAP preconditioner with sub-blocking in which the lattice block in a node is further divided to several sub-blocks to extract enough parallelism for the 8-core CPU SPARC64TM VIIIfx of the K computer. To achieve a better convergence property we use the symmetric successive over-relaxation (SSOR) iteration with locally-lexicographical ordering for the sub-blocks in obtaining the block inverse. The SAP preconditioner is included in the single precision BiCGStab solver of the nested BiCGStab solver. The single precision part of the computational kernel are solely written with the SIMD oriented intrinsics to achieve the best performance of the on the K computer. We benchmark the single precision BiCGStab solver on the three lattice sizes: 123× 24, 243× 48 and 483× 96, with fixing the local lattice size in a node at 63× 12. We observe an ideal weak-scaling performance from 16 nodes to 4096 nodes. The performance of a computational kernel exceeds 50% efficiency, and the single precision BiCGstab has $26% susutained efficiency.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…