Triangular matrix inversion on Graphics Processing Unit

February 6th, 2010

Abstract:

Dense matrix inversion is a basic procedure in many linear algebra algorithms. A computationally arduous step in most dense matrix inversion methods is the inversion of triangular matrices as produced by factorization methods such as LU decomposition. In this paper, we demonstrate how triangular matrix inversion (TMI) can be accelerated considerably by using commercial Graphics Processing Units (GPU) in a standard PC. Our implementation is based on a divide and conquer type recursive TMI algorithm, efficiently adapted to the GPU architecture. Our implementation obtains a speedup of 34x versus a CPU-based LAPACK reference routine, and runs at up to 54 gigaflops/s on a GTX 280 in double precision. Limitations of the algorithm are discussed, and strategies to cope with them are introduced. In addition, we show how inversion of an L- and U-matrix can be performed concurrently on a GTX 295 based dual-GPU system at up to 90 gigaflops/s.

(Florian Ries, Tommaso De Marco, Matteo Zivieri and Roberto Guerrieri, Triangular Matrix Inversion on Graphics Processing Units, Supercomputing 2009, DOI 10.1145/1654059.1654069)

CULAtools: GPU-accelerated LAPACK

August 23rd, 2009

EM Photonics has recently released a preview beta edition of their CULAtools, an implementation of LAPACK for CUDA-enabled GPUs. This version comprises single precision LU decomposition, QR factorization, singular value decomposition and least squares. The full library, scheduled for release at NVIDIA GTC ‘09, will contain much more functionality and in particular single- and double-precision computations. Please refer to the website culatools.com for details, licenses and downloads.

MAGMA: LAPACK for GPUs and Multicore architectures

August 6th, 2009

The MAGMA project aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures, starting with current “Multicore+GPU” systems.

The MAGMA research is based on the idea that, to address the complex challenges of the emerging hybrid environments, optimal software solutions will themselves have to hybridized, combining the strengths of different algorithms within a single framework. Building on this idea, the MAGMA group aims to design linear algebra algorithms and frameworks for hybrid manycore and GPU systems that can enable applications to fully exploit the power that each of the hybrid components offers.

MAGMA v0.1 runs on CUDA-capable GPUs and multicore CPUs, and is available now.

Sparse Matrix-Vector Multiplication Toolkit for Graphics Processing Units

July 7th, 2009

Sparse Matrix-Vector Multiplication Toolkit for Graphics Processing Units (SpMV4GPU) is a library optimized for NVIDIA Graphics Processing Units (GPUs). The GPU is fast emerging as the ideal architecture to use as an accelerator in a heterogenous computing environment. Modern GPUs are designed not only for accelerating traditional graphics kernels, but also for general-purpose computationally intensive kernels. The state-of-the art GPUs exhibit very high computational capabilities at a reasonable price.

Sparse Matrix-Vector Multiplication is a core numerical analysis kernel used for a wide range of application domains, such as graphics, data mining, and image processing. SpMV4GPU is a sparse matrix-vector multiplication library optimized for the NVIDIA GPUs. It is developed using the NVIDIA C for CUDA language and API, and works on all NVIDIA GPUs with CUDA support. SpMV4GPU uses the standard sparse matrix storage formats, such as compressed row and column storage formats. It hides the intricacies of GPU programming by using an abstract interface. The SpMV4GPU interface also allows users to provide optional performance hints, and optionally use special storage representations. Experimental evaluation demonstrate that the SpMV library provides two to four times improvement over the equivalent solution provided by the NVIDIA’s CUDPP library.

Along with the library, there is an IBM Research technical paper by Muthu Manikandan Baskaran andRajesh Bordawekar available, “Optimizing Sparse Matrix-Vector Multiplication on GPUs“. (Muthu Manikandan Baskaran and Rajesh Bordawekar, “Optimizing Sparse Matrix-Vector Multiplication on GPUs“. IBM Research Technical Paper RC24704, 2008.)

Optimizing Sparse Matrix-Vector Multiplication on GPUs

April 13th, 2009

In this paper, the various challenges in developing a high-performance SpMV kernel on NVIDIA GPUs using the CUDA programming model are evaluated, and optimizations are proposed to effectively address them. The optimizations include: (1) exploiting synchronization-free parallelism, (2) optimized thread mapping based on the affinity towards optimal memory access pattern, (3) optimized off-chip memory access to tolerate the high access latency, and (4) exploiting data locality and reuse. The authors evaluate these optimizations on two classes of NVIDIA GPUs, namely, GeForce 8800 GTX and GeForce GTX 280, and compare the performance of their approach with that of existing parallel SpMV implementations such as (1) the SpMV library of Bell and Garland, (2) the CUDPP library, and (3) an implementation using an optimized segmented scan primitive. Their approach outperforms the CUDPP and segmented scan implementations by a factor of 2 to 8, and achieves up to 15% improvement over Bell and Garland’s SpMV library (Dec 8, 2008 version).

(Muthu Manikandan Baskaran; Rajesh Bordawekar. Optimizing Sparse Matrix-Vector Multiplication on GPUs. IBM Technical Report RC24704. 2008.)

Efficient Sparse Matrix-Vector Multiplication on CUDA

April 13th, 2009

Abstract from an NVIDIA Technical Report by Nathan Bell and Michael Garland:

The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many high-performance computing applications. While dense linear algebra readily maps to such platforms, harnessing this potential for sparse matrix computations presents additional challenges. Given its role in iterative methods for solving sparse linear systems and eigenvalue problems, sparse matrix-vector multiplication (SpMV) is of singular importance in sparse linear algebra.

In this paper we discuss data structures and algorithms for SpMV that are efficiently implemented on the CUDA platform for the fine-grained parallel architecture of the GPU. Given the memory-bound nature of SpMV, we emphasize memory bandwidth efficiency and compact storage formats. We consider a broad spectrum of sparse matrices, from those that are well-structured and regular to highly irregular matrices with large imbalances in the distribution of nonzeros per matrix row. We develop methods to exploit several common forms of matrix structure while o ering alternatives which accommodate greater irregularity.

On structured, grid-based matrices we achieve performance of 36 GFLOP/s in single precision and 16 GFLOP/s in double precision on a GeForce GTX 280 GPU. For unstructured finite-element matrices, we observe performance in excess of 15 GFLOP/s and 10 GFLOP/s in single and double precision respectively. These results compare favorably to prior state-of-the-art studies of SpMV methods on conventional multicore processors. Our double precision SpMV performance is generally two and a half times that of a Cell BE with 8 SPEs and more than ten times greater than that of a quad-core Intel Clovertown system.

(Nathan Bell and Michael Garland. “Efficient Sparse Matrix-Vector Multiplication on CUDA“.  NVIDIA Technical Report NVR-2008-004, December 2008.)

Concurrent Number Cruncher: a GPU implementation of a general sparse linear solver

October 16th, 2008

A wide class of numerical methods needs to solve a linear system, where the matrix pattern of non-zero coefficients can be arbitrary. These problems can greatly benefit from highly multithreaded computational power and large memory bandwidth available on GPUs, especially since dedicated general purpose APIs such as CTM (AMD-ATI) and CUDA (NVIDIA) have appeared. CUDA even provides a BLAS implementation, but only for dense matrices (CuBLAS). Other existing linear solvers for the GPU are also limited by their internal matrix representation. This paper describes how to combine recent GPU programming techniques and new GPU dedicated APIs with high performance computing strategies (namely block compressed row storage, register blocking and vectorization), to implement a sparse general-purpose linear solver. This implementation of the Jacobi-preconditioned Conjugate Gradient algorithm outperforms by up to a factor of 6.0x leading-edge CPU counterparts, making it attractive for applications which are content with single precision. (Concurrent number cruncher – A GPU implementation of a general sparse linear solver. Luc Buatois, Guillaume Caumon and Bruno Lévy. International Journal of Parallel, Emergent and Distributed Systems. To Appear.)

CUDPP 1.0a Adds Segmented Scan and Sparse Matrix-Vector Multiplication

April 20th, 2008

Version 1.0 alpha of CUDPP, the CUDA Data-Parallel Algorithms Library, has been released. This version adds the segmented scan algorithm and sparse matrix-vector multiplication to CUDPP’s repertoire. Other new features include an improved “plan”-based configuration interface, an improved scan algorithm for higher performance, support for more inclusive scans and more scan operators, an improved stream compaction interface. In addition, CUDPP 1.0a adds support for CUDA 2.0 and the Windows Vista and Mac OS X (10.5.2 and higher) operating systems. CUDPP works with NVIDIA CUDA versions 1.1 and higher.

Acceleration of a 3D Euler Solver Using Commodity Graphics Hardware

January 18th, 2008

Abstract:

The porting of two- and three-dimensional Euler solvers from a conventional CPU implementation to the novel target platform of the Graphics Processing Unit (GPU) is described. The motivation for such an effort is the impressive performance that GPUs offer: typically 10 times more floating point operations per second than a modern CPU, with over 100 processing cores and all at a very modest financial cost. Both codes were found to generate the same results on the GPU as the FORTRAN versions did on the CPU. The 2D solver ran up to 29 times quicker on the GPU than on the CPU; the 3D solver 16 times faster.

(Tobias Brandvik and Graham Pullan, Acceleration of a 3D Euler Solver Using Commodity Graphics Hardware. 46th AIAA Aerospace Sciences Meeting and Exhibit. January, 2008.)

Using GPUs to Improve Multigrid Solver Performance on a Cluster

November 15th, 2007

This article by Göddeke et al. explores the coupling of coarse and fine-grained parallelism for Finite Element simulations based on efficient parallel multigrid solvers. The focus lies on both system performance and a minimally invasive integration of hardware acceleration into an existing software package, requiring no changes to application code. Because of their excellent price performance ratio, we demonstrate the viability of our approach by using commodity graphics processors (GPUs) as efficient multigrid preconditioners. We address the issue of limited precision on GPUs by applying a mixed precision, iterative refinement technique. Other restrictions are also handled by a close interplay between the GPU and CPU. From a software perspective, we integrate the GPU solvers into the existing MPI-based Finite Element package by implementing the same interfaces as the CPU solvers, so that for the application programmer they are easily interchangeable. Our results show that we do not compromise any software functionality and gain speedups of two and more for large problems. Equipped with this additional option of hardware acceleration we compare different choices in increasing the performance of a conventional, commodity based cluster by increasing the number of nodes, replacement of nodes by a newer technology generation, and adding powerful graphics cards to the existing nodes. (Dominik Göddeke, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick McCormick, Hilmar Wobker, Christian Becker and Stefan Turek. Using GPUs to Improve Multigrid Solver Performance on a Cluster. Accepted for publication in the International Journal of Computational Science and Engineering.)

Page 1 of 3123