You are here: Home » Archives for Numerical Algorithms
November 14th, 2011
Abstract:
Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA) libraries, and therefore have to be highly optimized. We present some techniques and implementations that significantly accelerate the corresponding routines from currently available libraries for GPUs. In particular, Pointer Redirecting – a set of GPU specific optimization techniques –allows us to easily remove performance oscillations associated with problem dimensions not divisible by fixed blocking sizes. For example, applied to the matrix-matrix multiplication routines, depending on the hardware configuration and routine parameters, this can lead to two times faster algorithms. Similarly, the matrix-vector multiplication can be accelerated more than two times in both single and double precision arithmetic. Additionally, GPU specific acceleration techniques are applied to develop new kernels (e.g. syrk, symv) that are up to 20x faster than the currently available kernels. We present these kernels and also show their acceleration e!ect to higher level dense linear algebra routines. The accelerated kernels are now freely available through the MAGMA BLAS library.
(R. Nath, S. Tomov and J. Dongarra: “Accelerating GPU Kernels for Dense Linear Algebra”, VECPAR 2010. [PDF])
Posted in Research | Tags: Dense Linear Algebra, Numerical Algorithms, NVIDIA CUDA, Papers | Write a comment
November 14th, 2011
Abstract:
We present an improved matrix–matrix multiplication routine (General Matrix Multiply [GEMM]) in the MAGMA BLAS library that targets the NVIDIA Fermi graphics processing units (GPUs) using Compute Unified Data Architecture (CUDA). We show how to modify the previous MAGMA GEMM kernels in order to make a more efficient use of the Fermi’s new architectural features, most notably their extended memory hierarchy and memory sizes. The improved kernels run at up to 300 GFlop/s in double precision and up to 645 GFlop/s in single precision arithmetic (on a C2050), which is correspondingly 58% and 63% of the theoretical peak. We compare the improved kernels with the currently available version in CUBLAS 3.1. Further, we show the effect of the new kernels on higher-level dense linear algebra (DLA) routines such as the one-sided matrix factorizations, and compare their performances with corresponding, currently available routines running on homogeneous multicore systems.
(R. Nath and S. Tomov and J. Dongarra: “An Improved MAGMA GEMM For Fermi Graphics Processing Units”, International Journal of High Performance Computing Applications. 24(4), 511-515, 2010. [DOI] [PREPRINT])
Posted in Research | Tags: Dense Linear Algebra, Numerical Algorithms, NVIDIA CUDA, NVIDIA FERMI, Papers | Write a comment
November 10th, 2011
EM Photonics has released CULA Sparse, a ready-to-integrate package for solving sparse linear systems. Features include:
- Interfaces: C, C++, Fortran, Matlab, Python
- Platforms: all CUDA platforms. including Linux, Windows, and OS X
- Solvers and preconditioners: BiCG, BiCGStab, CG, GMRES, MINRES and Jacobi, ILU(0)
- Data formats: COO, CSR, CSC in double precision real and complex floating point
- No CUDA programming experience required.
More information is available at http://www.culatools.com/sparse.
Posted in Business, Developer Resources | Tags: Libraries, Numerical Algorithms, NVIDIA CUDA, Sparse Linear Systems | Write a comment
September 4th, 2011
Abstract:
We parallelize a version of the active-set iterative algorithm derived from the original works of Lawson and Hanson (1974) on multi-core architectures. This algorithm requires the solution of an unconstrained least squares problem in every step of the iteration for a matrix composed of the passive columns of the original system matrix. To achieve improved performance, we use parallelizable procedures to efficiently update and {\em downdate} the QR factorization of the matrix at each iteration, to account for inserted and removed columns. We use a reordering strategy of the columns in the decomposition to reduce computation and memory access costs. We consider graphics processing units (GPUs) as a new mode for efficient parallel computations and compare our implementations to that of multi-core CPUs. Both synthetic and non-synthetic data are used in the experiments.
(Yuancheng Luo and Ramani Duraiswami, “Efficient Parallel Non-Negative Least Squares on Multicore Architectures”, SIAM Journal on Scientific Computing, accepted, Sep. 2011. [PDF] [Source code])
Posted in Research | Tags: Least-squares, Linear Algebra, Numerical Algorithms, NVIDIA CUDA, Papers | Write a comment
September 2nd, 2011
Abstract:
A Helmholtz equation in two dimensions discretized by a second order finite difference scheme is considered. Krylov methods such as Bi-CGSTAB and IDR(s) have been chosen as solvers. Since the convergence of the Krylov solvers deteriorates with increasing wave number, a shifted Laplace multigrid preconditioner is used to improve the convergence. The implementation of the preconditioned solver on CPU (Central Processing Unit) is compared to an implementation on GPU (Graphics Processing Units or graphics card) using CUDA (Compute Unified Device Architecture). The results show that preconditioned Bi-CGSTAB on GPU as well as preconditioned IDR(s) on GPU is about 30 times faster than on CPU for the same stopping criterion.
(H. Knibbe, C.W. Oosterlee and C. Vuik, “GPU implementation of a Helmholtz Krylov solver preconditioned by a shifted Laplace multigrid method”, accepted for publication in the Journal of Computational and Applied Mathematics, 2011. [DOI])
Posted in Research | Tags: Multigrid, Numerical Algorithms, NVIDIA CUDA, Papers | Write a comment
August 4th, 2011
Abstract:
Algebraic multigrid methods for large, sparse linear systems are a necessity in many computational simulations, yet parallel algorithms for such solvers are generally decomposed into coarse-grained tasks suitable for distributed computers with traditional processing cores. However, accelerating multigrid on massively parallel throughput-oriented processors, such as the GPU, demands algorithms with abundant fine-grained parallelism. In this paper, we develop a parallel algebraic multigrid method which exposes substantial fine-grained parallelism in both the construction of the multigrid hierarchy as well as the cycling or solve stage. Our algorithms are expressed in terms of scalable parallel primitives that are efficiently implemented on the GPU. The resulting solver achieves an average speedup of over 2x in the setup phase and around 6x in the cycling phase when compared to a representative CPU implementation.
(Nathan Bell, Steven Dalton and Luke Olson: “Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods”, NVIDIA Technical Report NVR-2011-002, June 2011 [PDF and Sources])
Posted in Research | Tags: Iterative Solvers, Multigrid, Numerical Algorithms, Papers, Sparse Linear Systems | Write a comment
July 29th, 2011
Abstract:
Multigrid methods are efficient and fast solvers for problems typically modeled by partial differential equations of elliptic type. For problems with complex geometries and local singularities stencil-type discrete operators on equidistant Cartesian grids need to be replaced by more flexible concepts for unstructured meshes in order to properly resolve all problem-inherent specifics and for maintaining a moderate number of unknowns. However, flexibility in the meshes goes along with severe drawbacks with respect to parallel execution – especially with respect to the definition of adequate smoothers. This point becomes in particular pronounced in the framework of fine-grained parallelism on GPUs with hundreds of execution units. We use the approach of matrix-based multigrid that has high flexibility and adapts well to the exigences of modern computing platforms.
In this work we investigate multi-colored Gauss-Seidel type smoothers, the power(q)-pattern enhanced multi-colored ILU(p) smoothers with fill-ins, and factorized sparse approximate inverse (FSAI) smoothers. These approaches provide efficient smoothers with a high degree of parallelism. In combination with matrix-based multigrid methods on unstructured meshes our smoothers provide powerful solvers that are applicable across a wide range of parallel computing platforms and almost arbitrary geometries. We describe the configuration of our smoothers in the context of the portable lmpLAtoolbox and the HiFlow3 parallel finite element package. In our approach, a single source code can be used across diverse platforms including multicore CPUs and GPUs. Highly optimized implementations are hidden behind a unified user interface. Efficiency and scalability of our multigrid solvers are demonstrated by means of a comprehensive performance analysis on multicore CPUs and GPUs.
V. Heuveline, D. Lukarski, N. Trost and J.-P. Weiss. Parallel Smoothers for Matrix-based Multigrid Methods on Unstructured Meshes Using Multicore CPUs and GPUs. EMCL Preprint Series No. 9. 2011.
Posted in Research | Tags: Multigrid, Numerical Algorithms, Papers, Scientific Computing | Write a comment
June 26th, 2011
Abstract:
A novel algorithm for solving in parallel a sparse triangular linear system on a graphical processing unit is proposed. It implements the solution of the triangular system in two phases. First, the analysis phase builds a dependency graph based on the matrix sparsity pattern and groups the independent rows into levels. Second, the solve phase obtains the full solution by iterating sequentially across the constructed levels. The solution elements corresponding to each single level are obtained at once in parallel. The numerical experiments are also presented and it is shown that the incomplete-LU and Cholesky preconditioned iterative methods, using the parallel sparse triangular solve algorithm, can achieve on average more than 2x speedup on graphical processing units (GPUs) over their CPU implementation.
(Maxim Naumov: “Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU”, NVIDIA Technical Report, June 2011. [WWW])
Posted in Developer Resources, Research | Tags: Numerical Algorithms, NVIDIA CUDA, Papers, Sparse Linear Systems | Write a comment
May 4th, 2011
ofgpu is a free GPL library from Symscape that provides GPU linear solvers for OpenFOAM®. The experimental library targets NVIDIA CUDA devices on Windows, Linux, and (untested) Mac OS X. It uses the Cusp library’s Krylov solvers to produce equivalent GPU (CUDA-based) versions of the standard OpenFOAM linear solvers:
- PCG – Preconditioned conjugate gradient solver for symmetric matrices (e.g., p)
- PBiCG – Preconditioned biconjugate gradient solver for asymmetric matrices (e.g., Ux, k)
ofgpu also has support for the OpenFOAM preconditioners:
For more details see “GPU Linear Solver Library for OpenFOAM”. OpenFOAM is a registered trademark of OpenCFD and is unaffiliated with Symscape.
Posted in Developer Resources | Tags: Fluid Simulation, Linear Algebra, Numerical Algorithms, OpenFOAM, Physics Simulation | Write a comment
February 13th, 2011
Abstract:
The paper discusses a fast implementation of the conjugate gradient iterative method with E-field multilevel preconditioner applied to solving real symmetric and sparse systems obtained with vector finite element method. In order to accelerate computations, a graphics processing unit (GPU) was used and significant speed-up (2.61 fold) was achieved comparing to a central processing unit (CPU) based approach. These results indicate that performance of electromagnetic simulations can be significantly improved thereby enabling full wave optimization of microwave components in more manageable time.
(A. Dziekonski, A. Lamecki and M. Mrozowski: “GPU Acceleration of Multilevel Solvers for Analysis of Microwave Components With Finite Element Method”, IEEE Microwave and Wireless Components Letters 21(1) pp.1-3, Jan. 2011. [DOI])
Posted in Research | Tags: Linear Algebra, Numerical Algorithms, Papers | 3 Comments