This paper by Anderson et al at Caltech describes a method to use GPUs to accelerate Quantum Monte Carlo on a GPU. QMC is among the most accurate (and expensive) methods in the quantum chemistry zoo. Primarily, this involves the investigation of tricks available to this algorithm to speed up matrix multiplication. That is, as a statistical algorithm, the authors studied the performance enhancements available when multiplying many matrices simultaneously. Additionally, the paper explores the Kahan Summation Formula to improve the accuracy of GPU matrix multiplication. (Quantum Monte Carlo on Graphical Processing Units. Amos G. Anderson, William A Goddard III, Peter Schroder. *Computer Physics Communications*)

## Quantum Monte Carlo on GPUs

September 10th, 2007## gDEBugger LINUX – Public Beta Available!

September 4th, 2007gDEBugger is an OpenGL Debugger and Profiler. It provides the application behavior information a developer needs to find bugs and to optimize application performance. gDEBugger Linux brings all of gDEBugger’s debugging and profiling abilities to the Linux OpenGL developers’ world. gDEBugger Linux is now available as a final beta version. This version includes all gDEBugger’s features and supports the Linux i386 and x86_64 architectures. gDEBugger Linux official version will be released shortly after Graphic Remedy receive feedback from the field and fix any reported issues. (http://www.gremedy.com/gDEBuggerLinux.php)

## Graphic processors to speed-up simulations for the design of high performance solar receptors

September 4th, 2007This paper by Collange et al. at UniversitÃ© de Perpignan, France, decribes a prototype to be integrated into simulation codes that estimate temperature, velocity and pressure to design next generation solar receptors. Such codes delegate to GPUs the computation of heat transfer due to radiation. The authors use Monte-Carlo line-by-line ray-tracing through finite volumes. This means data-parallel arithmetic transformations on large data structures. The performance on two recent graphics cards (Nvidia 7800GTX and ATI RX1800XL) show speedups higher than 400 compared to CPU implementations leaving most of CPU computing resources available. As there were some questions pending about the accuracy of the operators implemented in GPUs, the authors start this report with a survey and some contributed tests on the various floating point units available on GPUs. (Graphic processors to speed-up simulations for the design of high performance solar receptors. S. Collange, M. Daumas, D. Defour. *Proceedings of the IEEE 18th International Conference on Application-specific Systems, Architectures and Processors*.)

## CUDA Tutorial at Supercomputing 2007

August 22nd, 2007On Sunday November 11 2007 at SC07 in Reno NVIDIA will host a full-day tutorial on CUDA. In this tutorial NVIDIA engineers will partner with academic and industrial researchers to present CUDA and discuss its advanced use for science and engineering domains. The morning session will introduce CUDA programming and the execution and memory models at its heart, motivate the use of CUDA with many brief examples from different HPC domains, and discuss fundamental algorithmic building blocks in CUDA. The afternoon will discuss advanced issues such as optimization and “tips & tricks”, and include real-world case studies from domain scientists using CUDA (VMD and NAMD Molecular Dynamics and Oil and Gas).

Follow this link for more information: http://sc07.supercomputing.org/schedule/event_detail.php?evid=11034.

## Workshop on General Purpose Processing Using GPUs

August 19th, 2007Northeastern University

Boston, MA USA

October 4, 2007

Overview: The goal of this workshop is to provide a forum for general-purpose purpose GPU programming environments and platforms, as well as discuss applications that have been able to harness the horsepower provided by these platforms. This year’s workshop is

particularly interested in imaging applications. Papers are being sought on many aspects of GPUs, including (but not limited to):

- GPU applications
- GPU software and operating systems
- GPU programming environments
- GPU power/efficiency
- GPU architectures
- GPU benchmarking/measurements

Paper Submissions: Authors should submit an 8 page paper in IEEE double-column style to gpgpu@ece.neu.edu.

Industry Participation: The workshop encourages participation by GPU manufacturers, software vendors, or companies which develop or market products used by the GPU community. Any company interested in participating in the workshop should contact the workshop organizer at gpgpu@ece.neu.edu.

Important Dates:

Paper submission: August 28, 2007

Author notification: September 7, 2007

Final paper: September 14, 2007

Copies of final papers will be made available at the workshop. In addition, selected papers will be invited to be part of a special issue of an ACM or IEEE journal or magazine.

For more information, see the GPGPU 2007 web page

## Graphics Hardware 2007 Papers

August 16th, 2007On 4-5 August 2007, San Diego hosted the annual Graphics Hardware conference. GPGPU figured prominently in three papers:

- As transistors get smaller, their transient failure rates increase. Future architectures must adapt to address the resulting reliability problems. Jeremy Sheaffer presented a paper demonstrating a hardware-based redundancy approach to ensure reliability on GPGPU applications. (“A Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics Processors”. Jeremy Sheaffer, University of Virginia; David Luebke, NVIDIA Research; Kevin Skadron, University of Virginia.)
- Magnus Strengert presented a generic, minimally intrusive, and application-transparent GLSL debugger that operates transparently to the application. In it, shader debugging is performed on a per-draw call level; it allows singlestepping and the inspection of arbitrary variable content. Linux code is available and Windows code is expected by the end of the year. (“A Hardware-Aware Debugger for the OpenGL Shading Language”. Magnus Strengert, Thomas Klein, and Thomas Ertl, University of Stuttgart.)
- One critical need for GPGPU developers is a library of general-purpose building blocks for GPU computation. Shubhabrata Sengupta presented a paper describing a GPU implementation of the “scan primitives” and their use in novel GPU implementations of quicksort, efficient sparse matrix-vector multiplication, and tridiagonal matrix systems. This paper won the Best Paper award and the authors are preparing an open-source release. (“Scan Primitives for GPU Computing”. Shubhabrata Sengupta, UC Davis; Mark Harris, NVIDIA Corporation; Yao Zhang, UC Davis; John D. Owens, UC Davis.)

All Graphics Hardware 2007 papers are available in the ACM digital library. In addition, the GH07 program page contains slides for all talks as well as two keynote talks (Chas. Boyd of the Microsoft DirectX team: “Mass Market Applications of Data-Parallel Computing” and Michael Jones, chief technologist of Google Earth: “GPUs for the true mass market”) and vendor talks from AMD and NVIDIA about their latest processors (AMD Radeon HD 2900 and NVIDIA’s Tesla).

## Two-electron Integral Evaluation on the Graphics Processor Unit

August 16th, 2007Abstract: We propose the algorithm to evaluate the Coulomb potential in the ab initio density functional calculation on the graphics processor unit (GPU). The numerical accuracy required for the algorithm is investigated in detail. It is shown that GPU, which supports only the single-precision floating number natively, can take part in the major computational tasks. Because of the limited size of the working memory, the Gauss-Rys quadrature to evaluate the electron repulsion integrals (ERIs) is investigated in detail. The error analysis of the quadrature is performed. New interpolation formula of the roots and weights is presented, which is suitable for the processor of the single-instruction multiple-data type. It is proposed to calculate only small ERIs on GPU. ERIs can be classified efficiently with the upper-bound formula. The algorithm is implemented on NVIDIA GeForce 8800 GTX and the Gaussian 03 program suite. It is applied to the test molecules Taxol and Valinomycin. The total energies calculated are essentially the same as the reference ones. The preliminary results show the considerable speedup over the commodity microprocessor. (Two-electron integral evaluation on the graphics processor unit. Koji Yasuda. Journal of Computational Chemistry. July 5, 2007.)

## Accelerating molecular modeling applications with graphics processors

August 11th, 2007In this paper, an overview of recent advances in programmable GPUs is presented, with an emphasis on their application to molecular mechanics simulations and the programming techniques required to obtain optimal performance in these cases. We demonstrate the use of GPUs for the calculation of long-range electrostatics and nonbonded forces for molecular dynamics simulations. The application of GPU acceleration to biomolecular simulation is also demonstrated through the use of GPU-accelerated Coulomb-based ion placement and calculation of time-averaged potentials from molecular dynamics trajectories. A novel approximation to Coulomb potential calculation, the multilevel summation method, is introduced and compared to direct Coulomb summation. In light of the performance obtained for this set of calculations, future applications of graphics processors to molecular dynamics simulations are discussed. (Accelerating molecular modeling applications with graphics processors, John E. Stone, James C. Phillips, Peter L. Freddolino, David J. Hardy, Leonardo G. Trabuco, and Klaus Schulten.* Journal of Computational Chemistry (In press)*)

## High Performance Direct Gravitational N-body Simulations on Graphics Processing Units — II: An implementation in CUDA

July 27th, 2007Abstract: “We present the results of gravitational direct N-body simulations using the Graphics Processing Unit (GPU) on a commercial NVIDIA GeForce 8800GTX designed for gaming computers. The force evaluation of the N-body problem is implemented in “Compute Unified Device Architecture” (CUDA) using the GPU to speed-up the calculations. We tested the implementation on three different N-body codes: two direct N-body integration codes, using the 4th order predictor-corrector Hermite integrator with block time-steps, and one Barnes-Hut treecode, which uses a 2nd order leapfrog integration scheme. The integration of the equations of motions for all codes is performed on the host CPU. We find that for N > 512 particles the GPU outperforms the GRAPE-6Af, if some softening in the force calculation is accepted. Without softening and for very small integration time steps the GRAPE still outperforms the GPU. We conclude that modern GPUs offer an attractive alternative to GRAPE-6Af special purpose hardware. Using the same time-step criterion, the total energy of the N-body system was conserved to better than one in 10^6 on the GPU, only about an order of magnitude worse than obtained with GRAPE-6Af. For N > 10^5 the 8800GTX outperforms the host CPU by a factor of about 100 and runs at about the same speed as the GRAPE-6Af.” (Robert G. Belleman, Jeroen Bedorf, Simon Portegies Zwart. High Performance Direct Gravitational N-body Simulations on Graphics Processing Units — II: An implementation in CUDA. Accepted for publication in New Astronomy.)

## Graphic-Card Cluster for Astrophysics (GraCCA) — Performance Tests

July 27th, 2007Abstract: “In this paper, we describe the architecture and performance of the GraCCA system, a Graphic-Card Cluster for Astrophysics simulations. It consists of 16 nodes, with each node equipped with 2 modern graphic cards, the NVIDIA GeForce 8800 GTX. This computing cluster provides a theoretical performance of 16.2 TFLOPS. To demonstrate its performance in astrophysics computation, we have implemented a parallel direct N-body simulation program with shared time-step algorithm in this system. Our system achieves a measured performance of 7.1 TFLOPS and a parallel efficiency of 90% for simulating a globular cluster of 1024K particles. In comparing with the GRAPE-6A cluster at RIT (Rochester Institute of Technology), the GraCCA system achieves a more than twice higher measured speed and an even higher performance-per-dollar ratio. Moreover, our system can handle up to 320M particles and can serve as a general-purpose computing cluster for a wide range of astrophysics problems. (Hsi-Yu Schive, Chia-Hung Chien, Shing-Kwong Wong, Yu-Chih Tsai, Tzihong Chiueh. Graphic-Card Cluster for Astrophysics (GraCCA) — Performance Tests. submitted to New Astronomy, 20 July, 2007.)