Sparse Matrix-Vector Multiplication Toolkit for Graphics Processing Units

July 7th, 2009

Sparse Matrix-Vector Multiplication Toolkit for Graphics Processing Units (SpMV4GPU) is a library optimized for NVIDIA Graphics Processing Units (GPUs). The GPU is fast emerging as the ideal architecture to use as an accelerator in a heterogenous computing environment. Modern GPUs are designed not only for accelerating traditional graphics kernels, but also for general-purpose computationally intensive kernels. The state-of-the art GPUs exhibit very high computational capabilities at a reasonable price.

Sparse Matrix-Vector Multiplication is a core numerical analysis kernel used for a wide range of application domains, such as graphics, data mining, and image processing. SpMV4GPU is a sparse matrix-vector multiplication library optimized for the NVIDIA GPUs. It is developed using the NVIDIA C for CUDA language and API, and works on all NVIDIA GPUs with CUDA support. SpMV4GPU uses the standard sparse matrix storage formats, such as compressed row and column storage formats. It hides the intricacies of GPU programming by using an abstract interface. The SpMV4GPU interface also allows users to provide optional performance hints, and optionally use special storage representations. Experimental evaluation demonstrate that the SpMV library provides two to four times improvement over the equivalent solution provided by the NVIDIA’s CUDPP library.

Along with the library, there is an IBM Research technical paper by Muthu Manikandan Baskaran andRajesh Bordawekar available, “Optimizing Sparse Matrix-Vector Multiplication on GPUs“. (Muthu Manikandan Baskaran and Rajesh Bordawekar, “Optimizing Sparse Matrix-Vector Multiplication on GPUs“. IBM Research Technical Paper RC24704, 2008.)

CUDPP 1.1 Now Available

July 1st, 2009

Release 1.1 of the CUDA Data-Parallel Primitives Library (CUDPP) is now available for download.  The two major new features in CUDPP 1.1 are a very fast new radix sort implementation with support for sorting key-value pairs (with float or unsigned integer keys); and a new pseudorandom number generator, cudppRand. CUDPP 1.1 also replaces its former custom license with the standard BSD license. This greatly simplifies the CUDPP license details, and it also enables CUDPP to move into a public source repository such as Google Code in the near future. For more information, visit the CUDPP Website.

Numerical Precision: How Much is Enough?

June 30th, 2009

A ScientificComputing.com article by Rob Farber explores the topic of numerical precision in the context of future exascale computing, asking the question “how do we know that anything we compute is correct?”  The discussion centers around processors such as GPUs which provide both single- and double-precision computation but at different throughput levels. “Taking a multi-precision approach can enhance the accuracy of a calculation and justify the use of mainly single-precision arithmetic (for performance) along with the occasional use of double-precision (64-bit) arithmetic for precision-sensitive operations,” writes Farber. (Rob Farber. “Numerical Precision: How Much is Enough?” ScientificComputing.com.  Accessed July 1, 2008.)

CuPP – A framework for easy CUDA integration

June 26th, 2009

Abstract:

This paper reports on CuPP, our newly developed C++ framework designed to ease integration of NVIDIA’s GPGPU system, CUDA, into existing C++ applications. CuPP provides interfaces to reoccurring tasks that are easier to use than the standard CUDA interfaces. In this paper we concentrate on memory management and related data structures. CuPP offers both a low level interface — mostly consisting of smart pointers and memory allocation functions for GPU memory — and a high level interface offering a C++ STL vector wrapper and the so-called type transformations. The wrapper can be used by both device and host to automatically keep data in sync. The type transformations allow developers to write their own data structures offering the same functionality as the CuPP vector, in case a vector does not conform to the need of the application. Furthermore the type transformations offer a way to have two different representations for the same data at host and device, respectively. We demonstrate the benefits of using CuPP by integrating it into an example application, the open-source steering library OpenSteer. In particular, for this application we develop a uniform grid data structure to solve the k-nearest neighbor problem that deploys the type transformations. The paper finishes with a brief outline of another CUDA application, the Einstein@Home client, which also requires data structure redesign and thus may benefit from the type transformations and future work on CuPP.

(Jens Breitbart:  CuPP – A framework for easy CUDA integration, HiPS 2009 workshop with IPDPS 2009, Rome, Italy, May 2009)

ISC 2009 CUDA/OpenCL Tutorial Slides Posted

June 25th, 2009

A tutorial on High Performance Computing with CUDA was held at the International Conference on Supercomputing in Hamburg on Monday, June 22nd 2009.  The tutorial included an introduction to the CUDA programming model and C for CUDA, along with details on the CUDA Toolkit, Libraries, and optimization.  The tutorial also provided an introduction to OpenCL, and finished with a case study on Computational Fluid Dynamics by Dr. Graham Pullan from Cambridge University.  Slides from the tutorial are now posted here on GPGPU.org.

(Massimiliano Fatica, Timo Stich, and Graham Pullan.  High Performance Computing with CUDA.  Tutorial.  International Conference on Supercomputing 2009.  Hamburg, Germany.)

Efficient parallel scan algorithms for GPUs

June 24th, 2009

This NVIDIA technical report by Sengupta, Harris, and Garland describes the design of new parallel algorithms for scan and segmented scan on GPUs.   This paper describes the primitives included in the latest release of the CUDPP library.

Abstract:

Scan and segmented scan algorithms are crucial building blocks for a great many data-parallel algorithms. Segmented scan and related primitives also provide the necessary support for the flattening transform, which allows for nested data-parallel programs to be compiled into flat data-parallel languages. In this paper, we describe the design of efficient scan and segmented scan parallel primitives in CUDA for execution on GPUs. Our algorithms are designed using a divide-and-conquer approach that builds all scan primitives on top of a set of primitive intra-warp scan routines. We demonstrate that this design methodology results in routines that are simple, highly efficient, and free of irregular access patterns that lead to memory bank conflicts. These algorithms form the basis for current and upcoming releases of the widely used CUDPP library.

(S. Sengupta, M. Harris, and M. Garland. Efficient parallel scan algorithms for GPUs. NVIDIA Technical Report NVR-2008-003, December 2008)

Libra SDK: C/C++ for both the CPU and GPU

June 24th, 2009

GPU Systems has announced the Libra SDK, a robustly equipped C/C++ developer kit for fast and easy cross CPU-GPU access suited for scientific computations. The Libra 1.1 SDK includes a C/C++ Matlab-style API, sample programs and documentation. A downloadable trial version of Libra is available from the GPU Systems website, and a Libra demo presentation is also available.

PGI and NVIDIA Team To Deliver CUDA Fortran Compiler

June 24th, 2009

Yesterday the Portland Group and NVIDIA announced plans to develop new Fortran language support for CUDA GPUs.  The pair will release the Fortran language specification for CUDA GPUs at the International Conference on Supercomputing in Hamburg, Germany this week. The CUDA Fortran compiler will be added to a production release of the PGI Fortran compilers scheduled for availability in November 2009.

From the press release:

The Portland Group®, a wholly-owned subsidiary of STMicroelectronics and leading supplier of compilers for high-performance computing (HPC), today announced an agreement with NVIDIA under which the two companies plan to develop new Fortran language support for CUDA GPUs.

The NVIDIA® CUDA™ architecture allows developers to offload computationally intensive kernels to the massively parallel GPU. Through function calls and language extensions, CUDA gives developers explicit control over the mapping of general-purpose computational kernels to GPUs as well as placement and movement of data between the x64 processor and the GPU. The NVIDIA CUDA C compiler already provides this capability to C programmers. The CUDA Fortran compiler will provide this same level of control and optimization in a native Fortran environment from PGI.

New PGI 9.0 Compilers Simplify x64+GPU Programming

June 24th, 2009

Yesterday The Portland Group announced the release of version 9.0 of its Fortran and C compilers with support for GPUs and x64 multi-core CPUs.  An introduction to PGI Accelerator Fortran and C programming is available online, as is the PGI Accelerator v1.0 specification. Evaluation copies of the new PGI 9.0 compilers are available from The Portland Group web site. Registration is required.

From the press release:

The use of Graphics Processing Units (GPUs) as general purpose accelerators has been a growing trend in high-performance computing (HPC). Until now, use of GPUs from Fortran applications has been extremely limited. Developers targeting GPU accelerators have had to program in C at a detailed level using sequences of function calls to manage movement of data between the x64 host and GPU, and to offload computations from the host to the GPU. The PGI Accelerator Fortran and C compilers automatically analyze whole program structure and data, split portions of an application between a multi-core x64 CPU and a GPU as specified by user directives, and define and generate a mapping of loops to automatically use the parallel cores, hardware threading capabilities and SIMD vector capabilities of modern GPUs.

Read the rest of this entry »

GPGPU Paper Wins Best Paper Award at HPCS’09

June 23rd, 2009

The paper Fast Seismic Modeling and Reverse Time Migration on a GPU Cluster by Rached Abdelkhalek, Henri Calandra, Olivier Coulaud, Jean Roman and Guillaume Latu has earned the Best Paper Award at High Performance Computing and Simulation 2009, held June 21-24 in Leipzig, Germany.

This paper was presented in the Workshop on Architecture-Aware Simulation and Computing, organized by Michael Bader and Josef Weidendorfer (Technische Universität München). Three other GPGPU papers were part of this workshop:

The abstract of the award-winning paper is: Read the rest of this entry »

Page 1 of 4512345»...Last »