A performance evaluation of CCS QCD Benchmark on the COMA (Intel(R) Xeon PhiTM, KNC) system
Abstract
The most computationally demanding part of Lattice QCD simulations is solving quark propagators. Quark propagators are typically obtained with a linear equation solver utilizing HPC machines. The CCS QCD Benchmark is a benchmark program solving the Wilson-Clover quark propagator, and is developed at the Center for Computational Sciences (CCS), University of Tsukuba. We optimized the benchmark program for a (Knights Corner, KNC) system named "COMA (PACS-IX)" at CCS Tsukuba under the Intel Parallel Computing Center program. A single precision BiCGStab solver with the overlapped Restricted Additive Schwarz (RAS) preconditioner was implemented using SIMD intrinsics, OpenMP and MPI in the offload mode. With the reverse-offloading technique, we could reduce the communication and offloading overheads. We observed a performance of 200 GFlops sustained for the Wilson-Clover hopping matrix multiplication on the lattice sizes larger than 243× 32 on a sinlge card of the COMA system. A good weak scaling perofmace was observed on the local lattice sizes larger than 243× 32.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.