Distributed-memory H-matrix Algebra I: Data Distribution and Matrix-vector Multiplication
Abstract
We introduce a data distribution scheme for H-matrices and a distributed-memory algorithm for H-matrix-vector multiplication. Our data distribution scheme avoids an expensive (P2) scheduling procedure used in previous work, where P is the number of processes, while data balancing is well-preserved. Based on the data distribution, our distributed-memory algorithm evenly distributes all computations among P processes and adopts a novel tree-communication algorithm to reduce the latency cost. The overall complexity of our algorithm is O(N NP + α P + β 2 P ) for H-matrices under weak admissibility condition, where N is the matrix size, α denotes the latency, and β denotes the inverse bandwidth. Numerically, our algorithm is applied to address both two- and three-dimensional problems of various sizes among various numbers of processes. On thousands of processes, good parallel efficiency is still observed.