In this paper, an overview of recent advances in programmable GPUs is presented, with an emphasis on their application to molecular mechanics simulations and the programming techniques required to obtain optimal performance in these cases. We demonstrate the use of GPUs for the calculation of long-range electrostatics and nonbonded forces for molecular dynamics simulations. The application of GPU acceleration to biomolecular simulation is also demonstrated through the use of GPU-accelerated Coulomb-based ion placement and calculation of time-averaged potentials from molecular dynamics trajectories. A novel approximation to Coulomb potential calculation, the multilevel summation method, is introduced and compared to direct Coulomb summation. In light of the performance obtained for this set of calculations, future applications of graphics processors to molecular dynamics simulations are discussed. (Accelerating molecular modeling applications with graphics processors, John E. Stone, James C. Phillips, Peter L. Freddolino, David J. Hardy, Leonardo G. Trabuco, and Klaus Schulten. Journal of Computational Chemistry (In press))
High Performance Direct Gravitational N-body Simulations on Graphics Processing Units — II: An implementation in CUDAJuly 27th, 2007
Abstract: “We present the results of gravitational direct N-body simulations using the Graphics Processing Unit (GPU) on a commercial NVIDIA GeForce 8800GTX designed for gaming computers. The force evaluation of the N-body problem is implemented in “Compute Unified Device Architecture” (CUDA) using the GPU to speed-up the calculations. We tested the implementation on three different N-body codes: two direct N-body integration codes, using the 4th order predictor-corrector Hermite integrator with block time-steps, and one Barnes-Hut treecode, which uses a 2nd order leapfrog integration scheme. The integration of the equations of motions for all codes is performed on the host CPU. We find that for N > 512 particles the GPU outperforms the GRAPE-6Af, if some softening in the force calculation is accepted. Without softening and for very small integration time steps the GRAPE still outperforms the GPU. We conclude that modern GPUs offer an attractive alternative to GRAPE-6Af special purpose hardware. Using the same time-step criterion, the total energy of the N-body system was conserved to better than one in 10^6 on the GPU, only about an order of magnitude worse than obtained with GRAPE-6Af. For N > 10^5 the 8800GTX outperforms the host CPU by a factor of about 100 and runs at about the same speed as the GRAPE-6Af.” (Robert G. Belleman, Jeroen Bedorf, Simon Portegies Zwart. High Performance Direct Gravitational N-body Simulations on Graphics Processing Units — II: An implementation in CUDA. Accepted for publication in New Astronomy.)
Abstract: “In this paper, we describe the architecture and performance of the GraCCA system, a Graphic-Card Cluster for Astrophysics simulations. It consists of 16 nodes, with each node equipped with 2 modern graphic cards, the NVIDIA GeForce 8800 GTX. This computing cluster provides a theoretical performance of 16.2 TFLOPS. To demonstrate its performance in astrophysics computation, we have implemented a parallel direct N-body simulation program with shared time-step algorithm in this system. Our system achieves a measured performance of 7.1 TFLOPS and a parallel efficiency of 90% for simulating a globular cluster of 1024K particles. In comparing with the GRAPE-6A cluster at RIT (Rochester Institute of Technology), the GraCCA system achieves a more than twice higher measured speed and an even higher performance-per-dollar ratio. Moreover, our system can handle up to 320M particles and can serve as a general-purpose computing cluster for a wide range of astrophysics problems. (Hsi-Yu Schive, Chia-Hung Chien, Shing-Kwong Wong, Yu-Chih Tsai, Tzihong Chiueh. Graphic-Card Cluster for Astrophysics (GraCCA) — Performance Tests. submitted to New Astronomy, 20 July, 2007.)
A new workshop called AstroGPU 2007: General Purpose Computation on GPUs in Astronomy and Astrophysics will be held November 9-10th, 2007 at the Institute for Advanced Study in Princeton, NJ. The goal of this workshop is to explore and discuss the applicability of GPUs to astrophysical problems. It will bring together astrophysicists, colleagues from other areas of science where GPGPU techniques have been successfully applied, and representatives from the industry who will demonstrate in tutorial sessions the GPU hardware, programming tools, and GPGPU techniques. This workshop is geared towards astrophysicists wishing to learn GPGPU (specifically, CUDA) techniques and port their code to GPUs. For more information, see http://www.astrogpu.org.
I3D 2008 (aka the Symposium on Interactive 3D Graphics and Games) will be happening the weekend before GDC this year, February 15-17, in nearby Redwood City, CA. The Call For Participation is now up at the website: October 22 is this year’s paper deadline. This is a small conference, 100 attendees or so, that offers a good opportunity to meet other people working on GPU related techniques. I3D 2007 included a number of GPGPU-related papers on interactive ray tracing, mesh simplification, and histogram generation; see Ke-Sen Huang’s summary page. (CFP I3D 2008 page)
The trend of multicore processors development brings a shift of paradigm in applications development. Traditionally, increasing clock frequency is one of the main dimensions for conventional processors to achieve higher performance gains. Application developers used to improve performance of their applications by just waiting for faster processor platforms. Today, increasing clock frequency has reached a point of diminishing returnsâ€”and even negative returns if power is taken into account. Multicore processors, also known as Chip multiprocessors (CMPs), promise a power-efficiency way to increase performance and become more prevalent in vendors’ solutions, for example, IBM CELL Broadband Engine processors, Intel Core 2 Dual processors, and so on. However, the application or algorithm development process must be significantly changed in order to fully explore the potential of multicore processors. This special issue of the Journal of VLSI Signal Processing Systems is to discuss related challenges, issues, case studies, and solutions, especially focusing on multimedia-related applications, architectures, and programming environments, for example, understanding the complexity of developing a new application or porting an existing application onto a multicore processor. (Call for papers)
This paper by Banterle and Giacobazzi at UniversitÃ degli Studi di Verona presents an efficient implementation of the Octagon Abstract Domain (OAD) on graphics hardware. OAD is a relational numerical abstract domain which approximates invariants as conjunctions of constraints of the form +/- x +/- y <= c, where x and y are program variables and c is a constant which can be an integer, rational or real. OAD has been used with success in the aerospace industry for analyzing C programs such as the flight control software for the Airbus A340 fly-by-wire system. ( A Fast Implementation of the Octagon Abstract Domain on Graphics Hardware. Francesco Banterle and Roberto Giacobazzi. Proceeding of The 14th International Static Analysis Symposium (SAS). 2007)
This paper outlines how GPGPU techniques can be used for Monte Carlo simulations of quantum field theories such as QCD. The speedup is around a factor of 4-10 depending on the GPU model relative to SSE optimized code on a Pentium 4. Sample code is also given. (Lattice QCD as a video game)
According to an article on Extremetech.com , French company GPU-Tech has announced Ecolib, a series of C++ libraries for GPGPU which target both ATI and NVIDIA GPUs. A PDF describing the API is available. Their download page includes demo software with code samples and workstation CPU/GPU benchmarking tools.
This technical report by N. Cuntz, R. Strzodka and A. Kolb describes a particle level set (PLS) system for fast and accurate surface tracking on the GPU. The technique demonstrates the coupling of grid and particle information by using vertex/fragment buffer objects, shaders and blending functionality in an innovative way. Improvements over the original PLS technique include a sub-voxel interface representation and a more accurate level set correction using more precise particle radii. As a concrete application the authors demonstrate that their fast and accurate PLS is well suited to the visualization of dynamic flows. An accurate evolution of time surfaces and representation of path volumes offer a more reliable basis for data interpretation. (Real-Time Particle Level Sets with Application to Flow Visualization. Technical report, 2007)