Scalable Fast Multipole Methods for Heterogeneous CPU-GPU architectures

August 17th, 2011


We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPU-GPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divide-and-conquer algorithm that performs a fast N-body sum using a spatial decomposition and is often used in a time-stepping or iterative loop. Using the observation that the local summation and the analysis-based translation parts of the FMM are independent, we map these respectively to the GPUs and CPUs. Careful analysis of the FMM is performed to distribute work optimally between the multicore CPUs and the GPU accelerators. We first develop a single node version where the CPU part is parallelized using OpenMP and the GPU version via CUDA. New parallel algorithms for creating FMM data structures are presented together with load balancing strategies for the single node and distributed multiple-node versions. Our 8 GPU performance
is comparable with performance of a 256 GPU version of the FMM that won the 2009 Bell prize.

(Qi Hu, Nail A. Gumerov and Ramani Duraswami: “Scalable fast multipole methods on distributed heterogeneous architectures”, accepted for SC’11. [PDF])

Swarm-NG: integration of an ensemble of N-body systems

July 29th, 2010

The Swarm-NG package helps scientists and engineers harness the power of GPUs. In the early releases, Swarm-NG will focus on the integration of an ensemble of N-body systems evolving under Newtonian gravity. Swarm-NG does not replicate existing libraries that calculate forces for large-N systems on GPUs, but rather focuses on integrating an ensemble of many systems where N is small. This is of particular interest for astronomers who study the chaotic evolution of planetary systems. In the long term, we hope Swarm-NG will allow for the efficient parallel integration of user-defined systems of ordinary differential equations.

SAPPORO: A way to turn your graphics cards into a GRAPE-6

March 11th, 2009


In this paper, the authors present a library, named Sapporo, which closely emulates the GRAPE-6 API. The library is written in CUDA and implements most common functions that are used in N-body codes supporting GRAPE-6. As a result such codes will be able to use Sapporo without modification to their source code. The library also supports use of multiple GPUs per host. The authors carried out a series systematic tests to test the performance, accuracy and ability of the library to handle a realistic N-body problem. They found the performance of the library with a single G80/G92 GPU is a factor of two higher than that of GRAPE-6A(BLX) PCI(X)-cards, and the sustained performance with 2x GeForce 9800GX2 cards is on par with a 32-chip GRAPE-6 system (about 800 GFlop/s). The accuracy of the library is comparable to that of GRAPE-6 hardware, and its ability to correctly solve a realistic N-body problem provides an alternative for GRAPE-6 special purpose hardware.

(Evghenii Gaburov, Stefan Harfst and Simon Portegies Zwart, SAPPORO: A way to turn your graphics cards into a GRAPE-6, Submitted to New Astronomy)

Toward efficient GPU-accelerated N-body simulations

January 18th, 2008

Abstract: “N-body algorithms are applicable to a number of common problems in computational physics including gravitation, electrostatics, and fluid dynamics. Fast algorithms (those with better than O(N2) performance) exist, but have not been successfully implemented on GPU hardware for practical problems. In the present work, we introduce not only best-in-class performance for a multipole-accelerated treecode method, but a series of improvements that support implementation of this solver on highly-data-parallel graphics processing units (GPUs). The greatly reduced computation times suggest that this problem is ideally suited for the current and next generations of single and cluster CPU-GPU architectures. We believe that this is an ideal method for practical computation of largescale turbulent flows on future supercomputing hardware using parallel vortex particle methods. (Mark J. Stock and Adrin Gharakhani, “Toward efficient GPU-accelerated N-body simulations,” in 46th AIAA Aerospace Sciences Meeting and Exhibit, AIAA 2008-608, January 2008, Reno, Nevada.)

Graphic-Card Cluster for Astrophysics (GraCCA) — Performance Tests

July 27th, 2007

Abstract: “In this paper, we describe the architecture and performance of the GraCCA system, a Graphic-Card Cluster for Astrophysics simulations. It consists of 16 nodes, with each node equipped with 2 modern graphic cards, the NVIDIA GeForce 8800 GTX. This computing cluster provides a theoretical performance of 16.2 TFLOPS. To demonstrate its performance in astrophysics computation, we have implemented a parallel direct N-body simulation program with shared time-step algorithm in this system. Our system achieves a measured performance of 7.1 TFLOPS and a parallel efficiency of 90% for simulating a globular cluster of 1024K particles. In comparing with the GRAPE-6A cluster at RIT (Rochester Institute of Technology), the GraCCA system achieves a more than twice higher measured speed and an even higher performance-per-dollar ratio. Moreover, our system can handle up to 320M particles and can serve as a general-purpose computing cluster for a wide range of astrophysics problems. (Hsi-Yu Schive, Chia-Hung Chien, Shing-Kwong Wong, Yu-Chih Tsai, Tzihong Chiueh. Graphic-Card Cluster for Astrophysics (GraCCA) — Performance Tests. submitted to New Astronomy, 20 July, 2007.)

High Performance Direct Gravitational N-body Simulations on Graphics Processing Units — II: An implementation in CUDA

July 27th, 2007

Abstract: “We present the results of gravitational direct N-body simulations using the Graphics Processing Unit (GPU) on a commercial NVIDIA GeForce 8800GTX designed for gaming computers. The force evaluation of the N-body problem is implemented in “Compute Unified Device Architecture” (CUDA) using the GPU to speed-up the calculations. We tested the implementation on three different N-body codes: two direct N-body integration codes, using the 4th order predictor-corrector Hermite integrator with block time-steps, and one Barnes-Hut treecode, which uses a 2nd order leapfrog integration scheme. The integration of the equations of motions for all codes is performed on the host CPU. We find that for N > 512 particles the GPU outperforms the GRAPE-6Af, if some softening in the force calculation is accepted. Without softening and for very small integration time steps the GRAPE still outperforms the GPU. We conclude that modern GPUs offer an attractive alternative to GRAPE-6Af special purpose hardware. Using the same time-step criterion, the total energy of the N-body system was conserved to better than one in 10^6 on the GPU, only about an order of magnitude worse than obtained with GRAPE-6Af. For N > 10^5 the 8800GTX outperforms the host CPU by a factor of about 100 and runs at about the same speed as the GRAPE-6Af.” (Robert G. Belleman, Jeroen Bedorf, Simon Portegies Zwart. High Performance Direct Gravitational N-body Simulations on Graphics Processing Units — II: An implementation in CUDA. Accepted for publication in New Astronomy.)