SMVM on GPU

July 29th, 2010

From the paper’s abstract:

A wide class of finite element electromagnetic applications requires computing very large sparse matrix vector multiplications (SMVM). Due to the sparsity pattern and size of the matrices, solvers can run relatively slowly. The rapid evolution of graphic processing units (GPUs) in performance, architecture and programmability make them very attractive platforms for accelerating computationally intensive kernels such as SMVM. This work presents a new algorithm to accelerate the performance of the SMVM kernel on graphic processing units.

From the paper’s conclusion:

We have introduced several efficient techniques to accelerate the execution of the sparse matrix vector multiplication (SMVM) on NVIDIA graphic processing units. The proposed methods increased the performance of the SMVM kernel on GT 8800 up to 18.8 times compared to the quad core CPU and 3 times compared to previous work by Bell and Garland on accelerating SMVM for GPUs.

(M. Mehri Dehnavi, D. Fernandez and D. Giannacopoulos: “Finite element sparse matrix vector multiplication on GPUs”. IEEE Transactions on Magnetics, vol. 46, no. 8, pp. 2982-2985, August 2010. DOI 10.1109/TMAG.2010.2043511)

SpeedIT Toolkit 0.9.1 released

March 26th, 2010

The SpeedIT Tools library provides a set of accelerated solvers for sparse linear systems of equations. Manifold acceleration, e.g. more than an order of magnitude, is achieved with a single reasonably priced NVIDIA Graphics Processing Unit (GPU) that supports CUDA and proprietary advanced optimization techniques. The library can be used in a wide spectrum of domains arising from problems with underlying 2D and 3D geometry, such as computational fluid dynamics, electro-magnetics, thermodynamics, materials, acoustics, computer vision and graphics, robotics, semiconductor devices and structural engineering. The library can be also used for problems without defined geometry such as quantum chemistry, statistics, power networks and other graphs and chemical process simulation. All computations are performed with single or double floating point precision. Two version of SpeedIT toolkit have been released: The classic version provides a conjugate gradient solver, and the extreme edition provides optimized CG, BiCGSTAB, diagonal preconditioner, memory management, and heuristic-based analysis of input matrices.

OpenNL 3.0: CUDA sparse linear solvers

February 14th, 2010

OpenNL (Open Numerical Library) is a library for solving sparse linear systems, especially designed for the Computer Graphics community. The goal of OpenNL is to be as small as possible, while offering the subset of functionalities required by this application field. The Makefiles of OpenNL can generate a single .c and .h file that make it very easy to integrate into other projects. The distribution includes an implementation of a Least Squares Conformal Maps parameterization method.  The new version 3.0 of OpenNL includes support for CUDA (with Concurrent Number Cruncher and CUSP ELL formats).

Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

August 23rd, 2009

Abstract:

Sparse matrix-vector multiplication (SpMV) is of singular importance in sparse linear algebra. In contrast to the uniform regularity of dense linear algebra, sparse operations encounter a broad spectrum of matrices ranging from the regular to the highly irregular. Harnessing the tremendous potential of throughput-oriented processors for sparse operations requires that we expose substantial fine-grained parallelism and impose sufficient regularity on execution paths and memory access patterns. We explore SpMV methods that are well-suited to throughput-oriented architectures like the GPU and which exploit several common sparsity classes. The techniques we propose are efficient, successfully utilizing large percentages of peak bandwidth. Furthermore, they deliver excellent total throughput, averaging 16 GFLOP/s and 10 GFLOP/s in double precision for structured grid and unstructured mesh matrices, respectively, on a GeForce GTX 285. This is roughly 2.8 times the throughput previously achieved on Cell BE and more than 10 times that of a quad-core Intel Clovertown system.

(“Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors“. Nathan Bell and Michael Garland, in “Proc. Supercomputing ’09″, August 2009.)

Sparse Matrix-Vector Multiplication Toolkit for Graphics Processing Units

July 7th, 2009

Sparse Matrix-Vector Multiplication Toolkit for Graphics Processing Units (SpMV4GPU) is a library optimized for NVIDIA Graphics Processing Units (GPUs). The GPU is fast emerging as the ideal architecture to use as an accelerator in a heterogenous computing environment. Modern GPUs are designed not only for accelerating traditional graphics kernels, but also for general-purpose computationally intensive kernels. The state-of-the art GPUs exhibit very high computational capabilities at a reasonable price.

Sparse Matrix-Vector Multiplication is a core numerical analysis kernel used for a wide range of application domains, such as graphics, data mining, and image processing. SpMV4GPU is a sparse matrix-vector multiplication library optimized for the NVIDIA GPUs. It is developed using the NVIDIA C for CUDA language and API, and works on all NVIDIA GPUs with CUDA support. SpMV4GPU uses the standard sparse matrix storage formats, such as compressed row and column storage formats. It hides the intricacies of GPU programming by using an abstract interface. The SpMV4GPU interface also allows users to provide optional performance hints, and optionally use special storage representations. Experimental evaluation demonstrate that the SpMV library provides two to four times improvement over the equivalent solution provided by the NVIDIA’s CUDPP library.

Along with the library, there is an IBM Research technical paper by Muthu Manikandan Baskaran andRajesh Bordawekar available, “Optimizing Sparse Matrix-Vector Multiplication on GPUs“. (Muthu Manikandan Baskaran and Rajesh Bordawekar, “Optimizing Sparse Matrix-Vector Multiplication on GPUs“. IBM Research Technical Paper RC24704, 2008.)

Optimizing Sparse Matrix-Vector Multiplication on GPUs

April 13th, 2009

In this paper, the various challenges in developing a high-performance SpMV kernel on NVIDIA GPUs using the CUDA programming model are evaluated, and optimizations are proposed to effectively address them. The optimizations include: (1) exploiting synchronization-free parallelism, (2) optimized thread mapping based on the affinity towards optimal memory access pattern, (3) optimized off-chip memory access to tolerate the high access latency, and (4) exploiting data locality and reuse. The authors evaluate these optimizations on two classes of NVIDIA GPUs, namely, GeForce 8800 GTX and GeForce GTX 280, and compare the performance of their approach with that of existing parallel SpMV implementations such as (1) the SpMV library of Bell and Garland, (2) the CUDPP library, and (3) an implementation using an optimized segmented scan primitive. Their approach outperforms the CUDPP and segmented scan implementations by a factor of 2 to 8, and achieves up to 15% improvement over Bell and Garland’s SpMV library (Dec 8, 2008 version).

(Muthu Manikandan Baskaran; Rajesh Bordawekar. Optimizing Sparse Matrix-Vector Multiplication on GPUs. IBM Technical Report RC24704. 2008.)

Efficient Sparse Matrix-Vector Multiplication on CUDA

April 13th, 2009

Abstract from an NVIDIA Technical Report by Nathan Bell and Michael Garland:

The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many high-performance computing applications. While dense linear algebra readily maps to such platforms, harnessing this potential for sparse matrix computations presents additional challenges. Given its role in iterative methods for solving sparse linear systems and eigenvalue problems, sparse matrix-vector multiplication (SpMV) is of singular importance in sparse linear algebra.

In this paper we discuss data structures and algorithms for SpMV that are efficiently implemented on the CUDA platform for the fine-grained parallel architecture of the GPU. Given the memory-bound nature of SpMV, we emphasize memory bandwidth efficiency and compact storage formats. We consider a broad spectrum of sparse matrices, from those that are well-structured and regular to highly irregular matrices with large imbalances in the distribution of nonzeros per matrix row. We develop methods to exploit several common forms of matrix structure while o ering alternatives which accommodate greater irregularity.

On structured, grid-based matrices we achieve performance of 36 GFLOP/s in single precision and 16 GFLOP/s in double precision on a GeForce GTX 280 GPU. For unstructured finite-element matrices, we observe performance in excess of 15 GFLOP/s and 10 GFLOP/s in single and double precision respectively. These results compare favorably to prior state-of-the-art studies of SpMV methods on conventional multicore processors. Our double precision SpMV performance is generally two and a half times that of a Cell BE with 8 SPEs and more than ten times greater than that of a quad-core Intel Clovertown system.

(Nathan Bell and Michael Garland. “Efficient Sparse Matrix-Vector Multiplication on CUDA“.  NVIDIA Technical Report NVR-2008-004, December 2008.)