You are here: Home » Archives for Linear Algebra
July 7th, 2009
Sparse Matrix-Vector Multiplication Toolkit for Graphics Processing Units (SpMV4GPU) is a library optimized for NVIDIA Graphics Processing Units (GPUs). The GPU is fast emerging as the ideal architecture to use as an accelerator in a heterogenous computing environment. Modern GPUs are designed not only for accelerating traditional graphics kernels, but also for general-purpose computationally intensive kernels. The state-of-the art GPUs exhibit very high computational capabilities at a reasonable price.
Sparse Matrix-Vector Multiplication is a core numerical analysis kernel used for a wide range of application domains, such as graphics, data mining, and image processing. SpMV4GPU is a sparse matrix-vector multiplication library optimized for the NVIDIA GPUs. It is developed using the NVIDIA C for CUDA language and API, and works on all NVIDIA GPUs with CUDA support. SpMV4GPU uses the standard sparse matrix storage formats, such as compressed row and column storage formats. It hides the intricacies of GPU programming by using an abstract interface. The SpMV4GPU interface also allows users to provide optional performance hints, and optionally use special storage representations. Experimental evaluation demonstrate that the SpMV library provides two to four times improvement over the equivalent solution provided by the NVIDIA’s CUDPP library.
Along with the library, there is an IBM Research technical paper by Muthu Manikandan Baskaran andRajesh Bordawekar available, “Optimizing Sparse Matrix-Vector Multiplication on GPUs“. (Muthu Manikandan Baskaran and Rajesh Bordawekar, “Optimizing Sparse Matrix-Vector Multiplication on GPUs“. IBM Research Technical Paper RC24704, 2008.)
Posted in Developer Resources, Research | Tags: Libraries, Linear Algebra, NVIDIA CUDA, Sparse Linear Systems | Write a comment
April 13th, 2009
In this paper, the various challenges in developing a high-performance SpMV kernel on NVIDIA GPUs using the CUDA programming model are evaluated, and optimizations are proposed to effectively address them. The optimizations include: (1) exploiting synchronization-free parallelism, (2) optimized thread mapping based on the affinity towards optimal memory access pattern, (3) optimized off-chip memory access to tolerate the high access latency, and (4) exploiting data locality and reuse. The authors evaluate these optimizations on two classes of NVIDIA GPUs, namely, GeForce 8800 GTX and GeForce GTX 280, and compare the performance of their approach with that of existing parallel SpMV implementations such as (1) the SpMV library of Bell and Garland, (2) the CUDPP library, and (3) an implementation using an optimized segmented scan primitive. Their approach outperforms the CUDPP and segmented scan implementations by a factor of 2 to 8, and achieves up to 15% improvement over Bell and Garland’s SpMV library (Dec 8, 2008 version).
(Muthu Manikandan Baskaran; Rajesh Bordawekar. Optimizing Sparse Matrix-Vector Multiplication on GPUs. IBM Technical Report RC24704. 2008.)
Posted in Research | Tags: Linear Algebra, Papers, Sparse Linear Systems | Write a comment
April 13th, 2009
Abstract from an NVIDIA Technical Report by Nathan Bell and Michael Garland:
The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many high-performance computing applications. While dense linear algebra readily maps to such platforms, harnessing this potential for sparse matrix computations presents additional challenges. Given its role in iterative methods for solving sparse linear systems and eigenvalue problems, sparse matrix-vector multiplication (SpMV) is of singular importance in sparse linear algebra.
In this paper we discuss data structures and algorithms for SpMV that are efficiently implemented on the CUDA platform for the fine-grained parallel architecture of the GPU. Given the memory-bound nature of SpMV, we emphasize memory bandwidth efficiency and compact storage formats. We consider a broad spectrum of sparse matrices, from those that are well-structured and regular to highly irregular matrices with large imbalances in the distribution of nonzeros per matrix row. We develop methods to exploit several common forms of matrix structure while o ering alternatives which accommodate greater irregularity.
On structured, grid-based matrices we achieve performance of 36 GFLOP/s in single precision and 16 GFLOP/s in double precision on a GeForce GTX 280 GPU. For unstructured finite-element matrices, we observe performance in excess of 15 GFLOP/s and 10 GFLOP/s in single and double precision respectively. These results compare favorably to prior state-of-the-art studies of SpMV methods on conventional multicore processors. Our double precision SpMV performance is generally two and a half times that of a Cell BE with 8 SPEs and more than ten times greater than that of a quad-core Intel Clovertown system.
(Nathan Bell and Michael Garland. “Efficient Sparse Matrix-Vector Multiplication on CUDA“. NVIDIA Technical Report NVR-2008-004, December 2008.)
Posted in Research | Tags: Linear Algebra, Papers, Sparse Linear Systems | Write a comment
October 16th, 2008
A wide class of numerical methods needs to solve a linear system, where the matrix pattern of non-zero coefficients can be arbitrary. These problems can greatly benefit from highly multithreaded computational power and large memory bandwidth available on GPUs, especially since dedicated general purpose APIs such as CTM (AMD-ATI) and CUDA (NVIDIA) have appeared. CUDA even provides a BLAS implementation, but only for dense matrices (CuBLAS). Other existing linear solvers for the GPU are also limited by their internal matrix representation. This paper describes how to combine recent GPU programming techniques and new GPU dedicated APIs with high performance computing strategies (namely block compressed row storage, register blocking and vectorization), to implement a sparse general-purpose linear solver. This implementation of the Jacobi-preconditioned Conjugate Gradient algorithm outperforms by up to a factor of 6.0x leading-edge CPU counterparts, making it attractive for applications which are content with single precision. (Concurrent number cruncher – A GPU implementation of a general sparse linear solver. Luc Buatois, Guillaume Caumon and Bruno Lévy. International Journal of Parallel, Emergent and Distributed Systems. To Appear.)
Posted in Research | Tags: Linear Algebra, Papers | Write a comment
April 20th, 2008
Version 1.0 alpha of CUDPP, the CUDA Data-Parallel Algorithms Library, has been released. This version adds the segmented scan algorithm and sparse matrix-vector multiplication to CUDPP’s repertoire. Other new features include an improved “plan”-based configuration interface, an improved scan algorithm for higher performance, support for more inclusive scans and more scan operators, an improved stream compaction interface. In addition, CUDPP 1.0a adds support for CUDA 2.0 and the Windows Vista and Mac OS X (10.5.2 and higher) operating systems. CUDPP works with NVIDIA CUDA versions 1.1 and higher.
Posted in Developer Resources | Tags: Data-Parallel, Linear Algebra, NVIDIA CUDA | Write a comment
January 18th, 2008
Abstract:
The porting of two- and three-dimensional Euler solvers from a conventional CPU implementation to the novel target platform of the Graphics Processing Unit (GPU) is described. The motivation for such an effort is the impressive performance that GPUs offer: typically 10 times more floating point operations per second than a modern CPU, with over 100 processing cores and all at a very modest financial cost. Both codes were found to generate the same results on the GPU as the FORTRAN versions did on the CPU. The 2D solver ran up to 29 times quicker on the GPU than on the CPU; the 3D solver 16 times faster.
(Tobias Brandvik and Graham Pullan, Acceleration of a 3D Euler Solver Using Commodity Graphics Hardware. 46th AIAA Aerospace Sciences Meeting and Exhibit. January, 2008.)
Posted in Research | Tags: Linear Algebra, Papers, Physics Simulation | Write a comment
November 15th, 2007
This article by Göddeke et al. explores the coupling of coarse and fine-grained parallelism for Finite Element simulations based on efficient parallel multigrid solvers. The focus lies on both system performance and a minimally invasive integration of hardware acceleration into an existing software package, requiring no changes to application code. Because of their excellent price performance ratio, we demonstrate the viability of our approach by using commodity graphics processors (GPUs) as efficient multigrid preconditioners. We address the issue of limited precision on GPUs by applying a mixed precision, iterative refinement technique. Other restrictions are also handled by a close interplay between the GPU and CPU. From a software perspective, we integrate the GPU solvers into the existing MPI-based Finite Element package by implementing the same interfaces as the CPU solvers, so that for the application programmer they are easily interchangeable. Our results show that we do not compromise any software functionality and gain speedups of two and more for large problems. Equipped with this additional option of hardware acceleration we compare different choices in increasing the performance of a conventional, commodity based cluster by increasing the number of nodes, replacement of nodes by a newer technology generation, and adding powerful graphics cards to the existing nodes. (Dominik Göddeke, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick McCormick, Hilmar Wobker, Christian Becker and Stefan Turek. Using GPUs to Improve Multigrid Solver Performance on a Cluster. Accepted for publication in the International Journal of Computational Science and Engineering.)
Posted in Research | Tags: Clusters, Linear Algebra, Papers, Supercomputing | Write a comment
November 4th, 2006
This white paper from RapidMind and HP compares the performance of BLAS dense linear algebra operations, the FFT, and European option pricing on the GPU against highly tuned CPU implementations on the fastest available CPUs. All of the GPU implementations were made using the RapidMind Development Platform, which allows the use of standard C++ programming to create high-performance parallel applications that run on the GPU. The full source for the samples is available in conjunction with a new beta version of the RapidMind development platform. The results will also be presented as a poster at SC06.
(http://rapidmind.net/sc06_hp_rapidmind_cpugpu_summary.php)
Posted in Business, Research | Tags: Benchmarks, BLAS, Computational Finance, FFT, Linear Algebra, RapidMind | Write a comment
October 4th, 2006
This Supercomputing 2006 paper by Govindaraju et al. presents a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). The memory model is based on texturing hardware, which uses a 2D block-based array representation to perform the underlying computations. It incorporates many characteristics of GPU architectures including smaller cache sizes, 2D block representations, and uses the 3C’s model to analyze the cache misses. Moreover, the paper presents techniques to improve the performance of nested loops on GPUs. In order to demonstrate the effectiveness of the model, the paper highlights its performance on three memory-intensive scientific applications: sorting, Fast Fourier Transform and dense matrix multiplication. In practice, their cache-efficient algorithms for these applications are able to achieve memory throughput of 30-50 GB/s on an NVIDIA 7900 GTX GPU. The paper also compares its results with prior GPU-based and CPU-based implementations on high-end processors. In practice, they are able to achieve 2x-5x performance improvement. (A Memory Model for Scientic Algorithms on Graphics Processors)
Posted in Research | Tags: Cache, FFT, Linear Algebra, Memory Models, Papers, Sorting | Write a comment
June 6th, 2006
Rapid evolution of GPUs in performance, architecture, and programmability provides general and scientific computational potential beyond their primary purpose, graphics processing. This work presents an efficient algorithm for solving symmetric and positive definite linear systems using the GPU. Using the decomposition algorithm and other basic building blocks for linear algebra on the GPU, the paper demonstrates a GPU-powered linear program solver based on a Primal-Dual Interior-Point Method. (Cholesky Decomposition and Linear Programming on a GPU, Jin Hyuk Jung, Scholarly Paper Directed by Dianne P. O’Leary, Department of Computer Science, University of Maryland, 2006.)
Posted in Research | Tags: Linear Algebra, Papers | Write a comment