The 2009 workshop on Architecture-aware Simulation and Computing, held in conjunction with the 2009 International Conference on High Performance Computing & Simulation (HPCS 2009), will include a couple of talks on GPU computing. Please see the workshop website for more information. Registration information and the full conference program will be available soon.

## Workshop on Architecture-aware Simulation and Computing

April 15th, 2009## Minisymposium and Tutorial on GPU Computing at PPAM 2009

April 15th, 2009The paper deadline for the Minisymposium on GPU Computing at the 8th International Conference on Parallel Processing and Applied Mathematics has been extended to April 30. The minisymposium is organized by Jose R. Herrero, Enrique S. Quintana-Orti and Robert Strzodka, and will take place September 13-16 2009, in Wroclaw, Poland.

PPAM is also happy to announce a full day tutorial on GPU Computing, organized by Robert Strzodka and Dominik Göddeke. The program and list of speakers will be available soon.

## eResearch South Australia Workshop: High Performance GPU Computing with NVIDIA CUDA

April 14th, 2009This workshop, hosted by eResearch SA and to be presented by Mark Harris (NVIDIA) with Dragan Dimitrovici (Xenon Systems), aims to provide a detailed introduction to GPU computing with CUDA and NVIDIA GPUs such as the Tesla series of high-performance computing processors.

The workshop will be held from 9:00-13:00 on Tuesday 28th April, in the Henry Ayers Room, Ayers House

288 North Terrace, Adelaide (opposite the Royal Adelaide Hospital).

CUDA is NVIDIA’s revolutionary parallel computing architecture for GPUs. The available software tools include a C compiler for developers to build applications, as well as useful libraries for high-performance computing (BLAS, FFT, etc). Several widely-used scientific applications have been ported to run on GPUs using CUDA. This half-day workshop will provide an introduction to the CUDA architecture, programming model, and the programming environment of C for CUDA, as well as an overview of the Tesla GPU architecture, a live programming demo, and strategies for optimizing CUDA applications for the GPU. The workshop will also include a brief presentation of some of the current NVIDIA hardware offerings for GPU computing using CUDA.

The workshop is free, but space is limited. For complete details and registration, visit the workshop web page or download the brochure.

## Efficient Acceleration of Asymmetric Cryptography on Graphics Hardware

April 13th, 2009Abstract from the paper:

We present implementations of large integer modular exponentiation, the core of public-key cryptosystems such as RSA, on a DirectX 10 compliant GPU. We present high performance modular exponentiation implementations based on integers represented in both standard radix form and residue number system form. We show how a GPU implementation of a 1024-bit RSA decrypt primitive can outperform a comparable CPU implementation by up to 4 times and also improve the performance of previous GPU implementations by decreasing latency by up to 7 times and doubling throughput. We present how an adaptive approach to modular exponentiation involving implementations based on both a radix and a residue number system gives the best all-around performance on the GPU both in terms of latency and throughput. We also highlight the usage criteria necessary to allow the GPU to reach peak performance on public key cryptographic operations.

(Owen Harrison, John Waldron. Efficient Acceleration of Asymmetric Cryptography on Graphics Hardware. AfricaCrypt 2009, June 21-25, 2009, Gammarth, Tunisia. To Appear.)

## Optimizing Sparse Matrix-Vector Multiplication on GPUs

April 13th, 2009In this paper, the various challenges in developing a high-performance SpMV kernel on NVIDIA GPUs using the CUDA programming model are evaluated, and optimizations are proposed to effectively address them. The optimizations include: (1) exploiting synchronization-free parallelism, (2) optimized thread mapping based on the affinity towards optimal memory access pattern, (3) optimized off-chip memory access to tolerate the high access latency, and (4) exploiting data locality and reuse. The authors evaluate these optimizations on two classes of NVIDIA GPUs, namely, GeForce 8800 GTX and GeForce GTX 280, and compare the performance of their approach with that of existing parallel SpMV implementations such as (1) the SpMV library of Bell and Garland, (2) the CUDPP library, and (3) an implementation using an optimized segmented scan primitive. Their approach outperforms the CUDPP and segmented scan implementations by a factor of 2 to 8, and achieves up to 15% improvement over Bell and Garland’s SpMV library (Dec 8, 2008 version).

(*Muthu Manikandan Baskaran; Rajesh Bordawekar.** *Optimizing Sparse Matrix-Vector Multiplication on GPUs. IBM Technical Report RC24704. 2008.)

## Monte Carlo simulations on Graphics Processing Units

April 13th, 2009Abstract:

Implementation of basic local Monte-Carlo algorithms on ATI Graphics Processing Units (GPU) is investigated. The Ising model and pure SU(2) gluodynamics simulations are realized with the Compute Abstraction Layer (CAL) of ATI Stream environment using the Metropolis and the heat-bath algorithms, respectively. We present an analysis of both CAL programming model and the efficiency of the corresponding simulation algorithms on GPU. In particular, the significant performance speed-up of these algorithms in comparison with serial execution is observed.

(Vadim Demchik, Alexei Strelchenko. Monte Carlo simulations on Graphics Processing Units. arXiv:0903.3053 [hep-lat].)

## Molecular dynamics on NVIDIA GPUs with speed-ups up to two orders of magnitude

April 13th, 2009ACEMD is a production-class bio-molecular dynamics (MD) simulation program designed specifically for GPUs which is able to achieve supercomputing scale performance of 40 nanoseconds /day for all-atom protein systems with over 23,000 atoms. With GPU technology it has become possible to run a microsecond-long trajectory for an all-atom molecular system in explicit water on a single workstation computer equipped with just 3 GPUs. This performance would have required over 100 CPU cores. Visit the project website for details.

(M. J. Harvey, G. Giupponi, G. De Fabritiis, ACEMD: Accelerating bio-molecular dynamics in the microsecond time-scale. Link to preprint.)

## Path to Petascale: Adapting GEO/CHEM/ASTRO Applications for Accelerators and Accelerator Clusters

April 13th, 2009The workshop “Path to PetaScale: Adapting GEO/CHEM/ASTRO Applications for Accelerators and Accelerator Clusters” was held at the National Center for Supercomputing Applications (NCSA), University of Illinois Urbana-Champaign, on April 2-3, 2009. This workshop, sponsored by NSF and NCSA, helped computational scientists in the geosciences, computational chemistry, and astronomy and astrophysics communities take full advantage of emerging high-performance computing accelerators such as GPUs and Cell processors. The workshop consisted of joint technology sessions during the first day and domain-specific sessions on the second day. Slides from the presentations are now online.

## Second SHARCNET Symposium on GPU and Cell Computing

April 13th, 2009*University of Waterloo, Waterloo, Ontario, Canada
May 20th, 2009 *

This one-day symposium will explore the use of GPUs and Cell processors for accelerating scientific and high performance computing. The symposium program includes invited keynote presentations on large-scale fluid dynamics simulations using the Roadrunner supercomputer and acceleration of biomolecular modeling applications with GPU computing, as well as vendor research presentations from IBM, NVIDIA and RapidMind. Researchers working with these architectures are invited to contribute presentations and posters.

For further information and to register please visit the event website.

## Efficient Sparse Matrix-Vector Multiplication on CUDA

April 13th, 2009Abstract from an NVIDIA Technical Report by Nathan Bell and Michael Garland:

The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many high-performance computing applications. While dense linear algebra readily maps to such platforms, harnessing this potential for sparse matrix computations presents additional challenges. Given its role in iterative methods for solving sparse linear systems and eigenvalue problems, sparse matrix-vector multiplication (SpMV) is of singular importance in sparse linear algebra.

In this paper we discuss data structures and algorithms for SpMV that are efficiently implemented on the CUDA platform for the fine-grained parallel architecture of the GPU. Given the memory-bound nature of SpMV, we emphasize memory bandwidth efficiency and compact storage formats. We consider a broad spectrum of sparse matrices, from those that are well-structured and regular to highly irregular matrices with large imbalances in the distribution of nonzeros per matrix row. We develop methods to exploit several common forms of matrix structure while o ering alternatives which accommodate greater irregularity.

On structured, grid-based matrices we achieve performance of 36 GFLOP/s in single precision and 16 GFLOP/s in double precision on a GeForce GTX 280 GPU. For unstructured finite-element matrices, we observe performance in excess of 15 GFLOP/s and 10 GFLOP/s in single and double precision respectively. These results compare favorably to prior state-of-the-art studies of SpMV methods on conventional multicore processors. Our double precision SpMV performance is generally two and a half times that of a Cell BE with 8 SPEs and more than ten times greater than that of a quad-core Intel Clovertown system.

(Nathan Bell and Michael Garland. “Efficient Sparse Matrix-Vector Multiplication on CUDA“. NVIDIA Technical Report NVR-2008-004, December 2008.)