Graphics Hardware 2007 Papers

August 16th, 2007

On 4-5 August 2007, San Diego hosted the annual Graphics Hardware conference. GPGPU figured prominently in three papers:

  • As transistors get smaller, their transient failure rates increase. Future architectures must adapt to address the resulting reliability problems. Jeremy Sheaffer presented a paper demonstrating a hardware-based redundancy approach to ensure reliability on GPGPU applications. (“A Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics Processors”. Jeremy Sheaffer, University of Virginia; David Luebke, NVIDIA Research; Kevin Skadron, University of Virginia.)
  • Magnus Strengert presented a generic, minimally intrusive, and application-transparent GLSL debugger that operates transparently to the application. In it, shader debugging is performed on a per-draw call level; it allows singlestepping and the inspection of arbitrary variable content. Linux code is available and Windows code is expected by the end of the year. (“A Hardware-Aware Debugger for the OpenGL Shading Language”. Magnus Strengert, Thomas Klein, and Thomas Ertl, University of Stuttgart.)
  • One critical need for GPGPU developers is a library of general-purpose building blocks for GPU computation. Shubhabrata Sengupta presented a paper describing a GPU implementation of the “scan primitives” and their use in novel GPU implementations of quicksort, efficient sparse matrix-vector multiplication, and tridiagonal matrix systems. This paper won the Best Paper award and the authors are preparing an open-source release. (“Scan Primitives for GPU Computing”. Shubhabrata Sengupta, UC Davis; Mark Harris, NVIDIA Corporation; Yao Zhang, UC Davis; John D. Owens, UC Davis.)

All Graphics Hardware 2007 papers are available in the ACM digital library. In addition, the GH07 program page contains slides for all talks as well as two keynote talks (Chas. Boyd of the Microsoft DirectX team: “Mass Market Applications of Data-Parallel Computing” and Michael Jones, chief technologist of Google Earth: “GPUs for the true mass market”) and vendor talks from AMD and NVIDIA about their latest processors (AMD Radeon HD 2900 and NVIDIA’s Tesla).

Two-electron Integral Evaluation on the Graphics Processor Unit

August 16th, 2007

Abstract: We propose the algorithm to evaluate the Coulomb potential in the ab initio density functional calculation on the graphics processor unit (GPU). The numerical accuracy required for the algorithm is investigated in detail. It is shown that GPU, which supports only the single-precision floating number natively, can take part in the major computational tasks. Because of the limited size of the working memory, the Gauss-Rys quadrature to evaluate the electron repulsion integrals (ERIs) is investigated in detail. The error analysis of the quadrature is performed. New interpolation formula of the roots and weights is presented, which is suitable for the processor of the single-instruction multiple-data type. It is proposed to calculate only small ERIs on GPU. ERIs can be classified efficiently with the upper-bound formula. The algorithm is implemented on NVIDIA GeForce 8800 GTX and the Gaussian 03 program suite. It is applied to the test molecules Taxol and Valinomycin. The total energies calculated are essentially the same as the reference ones. The preliminary results show the considerable speedup over the commodity microprocessor. (Two-electron integral evaluation on the graphics processor unit. Koji Yasuda. Journal of Computational Chemistry. July 5, 2007.)

Accelerating molecular modeling applications with graphics processors

August 11th, 2007

In this paper, an overview of recent advances in programmable GPUs is presented, with an emphasis on their application to molecular mechanics simulations and the programming techniques required to obtain optimal performance in these cases. We demonstrate the use of GPUs for the calculation of long-range electrostatics and nonbonded forces for molecular dynamics simulations. The application of GPU acceleration to biomolecular simulation is also demonstrated through the use of GPU-accelerated Coulomb-based ion placement and calculation of time-averaged potentials from molecular dynamics trajectories. A novel approximation to Coulomb potential calculation, the multilevel summation method, is introduced and compared to direct Coulomb summation. In light of the performance obtained for this set of calculations, future applications of graphics processors to molecular dynamics simulations are discussed. (Accelerating molecular modeling applications with graphics processors, John E. Stone, James C. Phillips, Peter L. Freddolino, David J. Hardy, Leonardo G. Trabuco, and Klaus Schulten. Journal of Computational Chemistry (In press))

High Performance Direct Gravitational N-body Simulations on Graphics Processing Units — II: An implementation in CUDA

July 27th, 2007

Abstract: “We present the results of gravitational direct N-body simulations using the Graphics Processing Unit (GPU) on a commercial NVIDIA GeForce 8800GTX designed for gaming computers. The force evaluation of the N-body problem is implemented in “Compute Unified Device Architecture” (CUDA) using the GPU to speed-up the calculations. We tested the implementation on three different N-body codes: two direct N-body integration codes, using the 4th order predictor-corrector Hermite integrator with block time-steps, and one Barnes-Hut treecode, which uses a 2nd order leapfrog integration scheme. The integration of the equations of motions for all codes is performed on the host CPU. We find that for N > 512 particles the GPU outperforms the GRAPE-6Af, if some softening in the force calculation is accepted. Without softening and for very small integration time steps the GRAPE still outperforms the GPU. We conclude that modern GPUs offer an attractive alternative to GRAPE-6Af special purpose hardware. Using the same time-step criterion, the total energy of the N-body system was conserved to better than one in 10^6 on the GPU, only about an order of magnitude worse than obtained with GRAPE-6Af. For N > 10^5 the 8800GTX outperforms the host CPU by a factor of about 100 and runs at about the same speed as the GRAPE-6Af.” (Robert G. Belleman, Jeroen Bedorf, Simon Portegies Zwart. High Performance Direct Gravitational N-body Simulations on Graphics Processing Units — II: An implementation in CUDA. Accepted for publication in New Astronomy.)

Graphic-Card Cluster for Astrophysics (GraCCA) — Performance Tests

July 27th, 2007

Abstract: “In this paper, we describe the architecture and performance of the GraCCA system, a Graphic-Card Cluster for Astrophysics simulations. It consists of 16 nodes, with each node equipped with 2 modern graphic cards, the NVIDIA GeForce 8800 GTX. This computing cluster provides a theoretical performance of 16.2 TFLOPS. To demonstrate its performance in astrophysics computation, we have implemented a parallel direct N-body simulation program with shared time-step algorithm in this system. Our system achieves a measured performance of 7.1 TFLOPS and a parallel efficiency of 90% for simulating a globular cluster of 1024K particles. In comparing with the GRAPE-6A cluster at RIT (Rochester Institute of Technology), the GraCCA system achieves a more than twice higher measured speed and an even higher performance-per-dollar ratio. Moreover, our system can handle up to 320M particles and can serve as a general-purpose computing cluster for a wide range of astrophysics problems. (Hsi-Yu Schive, Chia-Hung Chien, Shing-Kwong Wong, Yu-Chih Tsai, Tzihong Chiueh. Graphic-Card Cluster for Astrophysics (GraCCA) — Performance Tests. submitted to New Astronomy, 20 July, 2007.)

Call for Participation: AstroGPU 2007

July 26th, 2007

A new workshop called AstroGPU 2007: General Purpose Computation on GPUs in Astronomy and Astrophysics will be held November 9-10th, 2007 at the Institute for Advanced Study in Princeton, NJ. The goal of this workshop is to explore and discuss the applicability of GPUs to astrophysical problems. It will bring together astrophysicists, colleagues from other areas of science where GPGPU techniques have been successfully applied, and representatives from the industry who will demonstrate in tutorial sessions the GPU hardware, programming tools, and GPGPU techniques. This workshop is geared towards astrophysicists wishing to learn GPGPU (specifically, CUDA) techniques and port their code to GPUs. For more information, see

Call For Participation for I3D 2008

July 19th, 2007

I3D 2008 (aka the Symposium on Interactive 3D Graphics and Games) will be happening the weekend before GDC this year, February 15-17, in nearby Redwood City, CA. The Call For Participation is now up at the website: October 22 is this year’s paper deadline. This is a small conference, 100 attendees or so, that offers a good opportunity to meet other people working on GPU related techniques. I3D 2007 included a number of GPGPU-related papers on interactive ray tracing, mesh simplification, and histogram generation; see Ke-Sen Huang’s summary page. (CFP I3D 2008 page)

JVSP Special Issue on Multicore Enabled Multimedia Applications & Architectures

July 17th, 2007

The trend of multicore processors development brings a shift of paradigm in applications development. Traditionally, increasing clock frequency is one of the main dimensions for conventional processors to achieve higher performance gains. Application developers used to improve performance of their applications by just waiting for faster processor platforms. Today, increasing clock frequency has reached a point of diminishing returns—and even negative returns if power is taken into account. Multicore processors, also known as Chip multiprocessors (CMPs), promise a power-efficiency way to increase performance and become more prevalent in vendors’ solutions, for example, IBM CELL Broadband Engine processors, Intel Core 2 Dual processors, and so on. However, the application or algorithm development process must be significantly changed in order to fully explore the potential of multicore processors. This special issue of the Journal of VLSI Signal Processing Systems is to discuss related challenges, issues, case studies, and solutions, especially focusing on multimedia-related applications, architectures, and programming environments, for example, understanding the complexity of developing a new application or porting an existing application onto a multicore processor. (Call for papers)

A Fast Implementation of the Octagon Abstract Domain on Graphics Hardware

July 14th, 2007

This paper by Banterle and Giacobazzi at Università degli Studi di Verona presents an efficient implementation of the Octagon Abstract Domain (OAD) on graphics hardware. OAD is a relational numerical abstract domain which approximates invariants as conjunctions of constraints of the form +/- x +/- y <= c, where x and y are program variables and c is a constant which can be an integer, rational or real. OAD has been used with success in the aerospace industry for analyzing C programs such as the flight control software for the Airbus A340 fly-by-wire system. ( A Fast Implementation of the Octagon Abstract Domain on Graphics Hardware. Francesco Banterle and Roberto Giacobazzi. Proceeding of The 14th International Static Analysis Symposium (SAS). 2007)

Lattice QCD as a video game (GPGPU for quantum field theory)

July 14th, 2007

This paper outlines how GPGPU techniques can be used for Monte Carlo simulations of quantum field theories such as QCD. The speedup is around a factor of 4-10 depending on the GPU model relative to SSE optimized code on a Pentium 4. Sample code is also given. (Lattice QCD as a video game)