Accelerating numerical algorithms for solving sparse linear systems on parallel architectures has attracted the attention of many researchers due to their applicability to many engineering and scientific problems. The solution of sparse systems often dominates the overall execution time of such problems and is mainly solved by iterative methods. Preconditioners are used to accelerate the convergence rate of these solvers and reduce the total execution time. Sparse Approximate Inverse (SAI) preconditioners are a popular class of preconditioners designed to improve the condition number of large sparse matrices and accelerate the convergence rate of iterative solvers for sparse linear systems. We propose a GPU accelerated SAI preconditioning technique called GSAI, which parallelizes the computation of this preconditioner on NVIDIA graphic cards. The preconditioner is then used to enhance the convergence rate of the BiConjugate Gradient Stabilized (BiCGStab) iterative solver on the GPU. The SAI preconditioner is generated on average 28 and 23 times faster on the NVIDIA GTX480 and TESLA M2070 graphic cards respectively compared to ParaSails (a popular implementation of SAI preconditioners on CPU) single processor/core results. The proposed GSAI technique computes the SAI preconditioner in approximately the same time as ParaSails generates the same preconditioner on 16 AMD Opteron 252 processors.
(Maryam Mehri Dehnavi, David Fernandez, Jean-Luc Gaudiot and Dennis Giannacopoulos: “Parallel Sparse Approximate Inverse Preconditioning on Graphic Processing Units”, IEEE Transactions on Parallel and Distributed Systems (to appear). [DOI])
In this work, we evaluate OpenCL as aprogramming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has required performance-impacting initializations that do not exist in other languages such as CUDA. Understanding these implications allows us to provide a single library with decent performance on a variety of platforms. We choose triangular solver (TRSM) and matrix multiplication (GEMM) as representative level 3 BLAS routines to implement in OpenCL. We profile TRSM to get the time distribution of the OpenCL runtime system. We then provide tuned GEMM kernels for both the NVIDIA Tesla C2050 and ATI Radeon 5870, the latest GPUs offered by both companies. We explore the benefits of using the texture cache, the performance ramifications of copying data into images, discrepancies in the OpenCL and CUDA compilers’ optimizations, and other issues that affect the performance. Experimental results show that nearly 50% of peak performance can be obtained in GEMM on both GPUs in OpenCL. We also show that the performance of these kernels is not highly portable. Finally, we propose the use of auto-tuning to better explore these kernels’ parameter space using search harness.
(Peng Du, Rick Weber, Piotr Luszczek, Stanimire Tomov, Gregory Peterson, Jack Dongarra, “From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming”, Parallel Computing 38(8):391–407, Aug. 2012. [DOI] [early techreport])
A wide range of applications in engineering and scientific computing are involved in the acceleration of the sparse matrix vector product (SpMV). Graphics Processing Units (GPUs) have recently emerged as platforms that yield outstanding acceleration factors. SpMV implementations for GPUs have already appeared on the scene. This work is focused on the ELLR-T algorithm to compute SpMV on GPU architecture, its performance is strongly dependent on the optimum selection of two parameters. Therefore, taking account that the memory operations dominate the performance of ELLR-T, an analytical model is proposed in order to obtain the auto-tuning of ELLR-T for particular combinations of sparse matrix and GPU architecture. The evaluation results with a representative set of test matrices show that the average performance achieved by auto-tuned ELLR-T by means of the proposed model is near to the optimum. A comparative analysis of ELLR-T against a variety of previous proposals shows that ELLR-T with the estimated configuration reaches the best performance on GPU architecture for the representative set of test matrices.
(Francisco Vázquez and José Jesús Fernández and Ester M. Garzón: “Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach”, Parallel Computing 38(8), 408-420, Aug. 2012. [DOI])
A novel algorithm for computing the incomplete-LU and Cholesky factorization with 0 fill-in on a graphics processing unit (GPU) is proposed. It implements the incomplete factorization of the given matrix in two phases. First, the symbolic analysis phase builds a dependency graph based on the matrix sparsity pattern and groups the independent rows into levels. Second, the numerical factorization phase obtains the resulting lower and upper sparse triangular factors by iterating sequentially across the constructed levels. The Gaussian elimination of the elements below the main diagonal in the rows corresponding to each single level is performed in parallel. The numerical experiments are also presented and it is shown that the numerical factorization phase can achieve on average more than 2.8x speedup over MKL, while the incomplete-LU and Cholesky preconditioned iterative methods can achieve an average of 2x speedup on GPU over their CPU implementation.
(Maxim Naumov. Parallel Incomplete-LU and Cholesky Factorization in the Preconditioned Iterative Methods on the GPU, NVIDIA Technical Report NVR-2012-003. May 2012.)
We study the use of a GPU for the numerical approximation of the curvature dependent flows of graphs – the mean-curvature flow and the Willmore flow. Both problems are often applied in image processing where fast solvers are required. We approximate these problems using the complementary finite volume method combined with the method of lines. We obtain a system of ordinary differential equations which we solve by the Runge–Kutta–Merson solver. It is a robust solver with an automatic choice of the integration time step. We implement this solver on CPU but also on GPU using the CUDA toolkit. We demonstrate that the mean-curvature flow can be successfully approximated in single precision arithmetic with the speed-up almost 17 on the Nvidia GeForce GTX 280 card compared to Intel Core 2 Quad CPU. On the same card, we obtain the speed-up 7 in double precision arithmetic which is necessary for the fourth order problem – the Willmore flow of graphs. Both speed-ups were achieved without affecting the accuracy of the approximation. The article is structured in such way that the reader interested only in the implementation of the Runge–Kutta–Merson solver on the GPU can skip the sections containing the mathematical formulation of the problems.
(Oberhuber T., Suzuki A., Žabka V.: “The CUDA implementation of the method of lines for the curvature dependent flows”, Kybernetika 47(2):251–272, 2011. [PDF])
Partial differential equations are typically solved by means of finite difference, finite volume or finite element methods resulting in large, highly coupled, ill-conditioned and sparse (non-)linear systems. In order to minimize the computing time we want to exploit the capabilities of modern parallel architectures. The rapid hardware shifts from single core to multi-core and many-core processors lead to a gap in the progression of algorithms and programming environments for these platforms — the parallel models for large clusters do not fully utilize the performance capability of the multi-core CPUs and especially of the GPUs. Software stack needs to run adequately on the next generation of computing devices in order to exploit the potential of these new systems. Moving numerical software from one platform to another becomes an important task since every parallel device has its own programming model and language. The greatest challenge is to provide new techniques for solving (non-)linear systems that combine scalability, portability, fine-grained parallelism and flexibility across the assortment of parallel platforms and programming models. The goal of this thesis is to provide new fine-grained parallel algorithms embedded in advanced sparse linear algebra solvers and preconditioners on the emerging multi-core and many-core technologies.
Read the rest of this entry »
SpeedIT 2.0 and the SpeedIT plugin to OpenFOAM have been released. New features include:
- One of the fastest Sparse Matrix Vector Multiplication worldwide.
- Faster Conjugate Gradient and BiConjugate Gradient solvers.
- State-of-the-art CMRS format for storing sparse matrices. The format requires less memory than CRS or HYB (from CUSPARSE and CUSP).
- Faster acceleration in OpenFOAM (Computational Fluid Dynamics).
More information is available at http://speed-it.vratis.com.
We describe our FE-gMG solver, a finite element geometric multigrid approach for problems relying on unstructured grids. We augment our GPU- and multicore-oriented implementation technique based on cascades of sparse matrix-vector multiplication by applying strong smoothers. In particular, we employ Sparse Approximate Inverse (SPAI) and Stabilised Approximate Inverse (SAINV) techniques. We focus on presenting the numerical efficiency of our smoothers in combination with low- and high-order finite element spaces as well as the hardware efficiency of the FE-gMG. For a representative problem and computational grids in 2D and 3D, we achieve a speedup of an average of 5 on a single GPU over a multithreaded CPU code in our benchmarks. In addition, our strong smoothers can deliver a speedup of 3-5 depending on the element space, compared to simple Jacobi smoothing. This can even be enhanced to a factor of 7 when combining the usage of Approximate Inverse-based smoothers with clever sorting of the degrees of freedom. In total the FE-gMG solver can outperform a simple, (multicore-)CPU-based multigrid by a total factor of over 40.
(Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac and Stefan Turek: “Towards a complete FEM-based simulation toolkit on GPUs: Unstructured Grid Finite Element Geometric Multigrid solvers with strong smoothers based on Sparse Approximate Inverses”, accepted for publication in Computers and Fluids, 2011. [preprint])
Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA) libraries, and therefore have to be highly optimized. We present some techniques and implementations that signiﬁcantly accelerate the corresponding routines from currently available libraries for GPUs. In particular, Pointer Redirecting – a set of GPU speciﬁc optimization techniques –allows us to easily remove performance oscillations associated with problem dimensions not divisible by ﬁxed blocking sizes. For example, applied to the matrix-matrix multiplication routines, depending on the hardware conﬁguration and routine parameters, this can lead to two times faster algorithms. Similarly, the matrix-vector multiplication can be accelerated more than two times in both single and double precision arithmetic. Additionally, GPU speciﬁc acceleration techniques are applied to develop new kernels (e.g. syrk, symv) that are up to 20x faster than the currently available kernels. We present these kernels and also show their acceleration e!ect to higher level dense linear algebra routines. The accelerated kernels are now freely available through the MAGMA BLAS library.
(R. Nath, S. Tomov and J. Dongarra: “Accelerating GPU Kernels for Dense Linear Algebra”, VECPAR 2010. [PDF])
We present an improved matrix–matrix multiplication routine (General Matrix Multiply [GEMM]) in the MAGMA BLAS library that targets the NVIDIA Fermi graphics processing units (GPUs) using Compute Unified Data Architecture (CUDA). We show how to modify the previous MAGMA GEMM kernels in order to make a more efficient use of the Fermi’s new architectural features, most notably their extended memory hierarchy and memory sizes. The improved kernels run at up to 300 GFlop/s in double precision and up to 645 GFlop/s in single precision arithmetic (on a C2050), which is correspondingly 58% and 63% of the theoretical peak. We compare the improved kernels with the currently available version in CUBLAS 3.1. Further, we show the effect of the new kernels on higher-level dense linear algebra (DLA) routines such as the one-sided matrix factorizations, and compare their performances with corresponding, currently available routines running on homogeneous multicore systems.
(R. Nath and S. Tomov and J. Dongarra: “An Improved MAGMA GEMM For Fermi Graphics Processing Units”, International Journal of High Performance Computing Applications. 24(4), 511-515, 2010. [DOI] [PREPRINT])