PyCOOL (Cosmological Object-Oriented Lattice code) is a fast GPU accelerated program that solves the evolution of interacting scalar fields in an expanding universe with symplectic algorithms. The program has been written with the intention to hit a sweet spot of speed, accuracy and user friendliness. This is achieved by using the Python language with the PyCUDA interface to make a program that is very easy to adapt to different scalar field models. The program is publicly available under GNU General Public License at. See the PyCOOL website for more information.

## PyCOOL: Python Cosmological Object-Oriented Lattice code

January 25th, 2012## Swarm-NG: integration of an ensemble of N-body systems

July 29th, 2010The Swarm-NG package helps scientists and engineers harness the power of GPUs. In the early releases, Swarm-NG will focus on the integration of an ensemble of N-body systems evolving under Newtonian gravity. Swarm-NG does not replicate existing libraries that calculate forces for large-N systems on GPUs, but rather focuses on integrating an ensemble of many systems where N is small. This is of particular interest for astronomers who study the chaotic evolution of planetary systems. In the long term, we hope Swarm-NG will allow for the efficient parallel integration of user-defined systems of ordinary differential equations.

## QYMSYM: A GPU-Accelerated Hybrid Symplectic Integrator That Permits Close Encounters

July 29th, 2010Abstract:

We describe a parallel hybrid symplectic integrator for planetary system integration that runs on a graphics processing unit (GPU). The integrator identifies close approaches between particles and switches from symplectic to Hermite algorithms for particles that require higher resolution integrations. The integrator is approximately as accurate as other hybrid symplectic integrators but is GPU accelerated.

(Alexander Moore and Alice C. Quillen: “QYMSYM: A GPU-Accelerated Hybrid Symplectic Integrator That Permits Close Encounters”. preprint on arXiv, available code)

## A double parallel, symplectic N-body code running on Graphic Processing Units

April 26th, 2010This paper by R.Capuzzo-Dolcetta, A. Mastrobuono-Battisti and D. Maschietti presents and discusses the characteristics and performances, both in terms of computational speed and precision, of a code which numerically integrates the equations of motion of N particles interacting via Newtonian gravitation and moving in a smooth external galactic field. The force evaluation on each particle is done by means of direct summation of the contribution of all the other particles in the system, avoiding truncation error. The time integration is done with second-order and sixth-order symplectic schemes. The code, called “NBSymple”, uses NVIDIA CUDA to perform the all-pairs force evaluation on an NVIDIA TESLA C1060 GPU, while the O(N) computations are distributed across CPUs using the OpenMP API. The code implements both single-precision and double-precision floating-point arithmetic. The use of single precision is faster on the C1060 GPU but limits the accuracy of the simulation in some critical situations. The authors find a good compromise in using a software reconstruction of double precision for those variables that are most critical for the overall precision of the code. The code is available for download. (Link to preprint.)

## Direct N-body Kernels for Multicore Platforms

January 24th, 2010From the abstract:

We present an inter-architectural comparison of single- and double-precision direct n-body implementations on modern multicore platforms, including those based on the Intel Nehalem and AMD Barcelona systems, the Sony-Toshiba-IBM PowerXCell/8i processor, and NVIDA Tesla C870 and C1060 GPU systems. We compare our implementations across platforms on a variety of proxy measures, including performance, coding complexity, and energy efficiency.

Nitin Arora, Aashay Shringarpure, and Richard Vuduc. “Direct n-body kernels for multicore platforms.” In Proc. Int’l. Conf. Parallel Processing (ICPP), Vienna, Austria, September 2009 (direct link to PDF).

## GPU Simulations of Gravitational Many-body Problem and GPU Octrees

January 20th, 2010This undergraduate thesis and poster by Kajuki Fujiwara and Naohito Nakasato from the University of Aizu approach a common problem in astrophysics: the many-body problem, with both brute-force and hierarchical data structures for solving it on ATI GPUs. Abstracts:

**Fast Simulations of Gravitational Many-body Problem on RV770 GPU
Kazuki Fujiwara, Naohito Nakasato (University of Aizu)
Abstract:**

The gravitational many-body problem is a problem concerning the movement of bodies, which are interacting through gravity. However, solving the gravitational many-body problem with a CPU takes a lot of time due to O(N^2) computational complexity. In this paper, we show how to speed-up the gravitational many-body problem by using GPU. After extensive optimizations, the peak performance obtained so far is about 1 Tflops.

**Oct-tree Method on GPU
N.Nakasato
Abstract:**

The kd-tree is a fundamental tool in computer science. Among others, an application of the kd-tree search (oct-tree method) to fast evaluation of particle interactions and neighbor search is highly important since computational complexity of these problems are reduced from O(N^2) with a brute force method to O(N log N) with the tree method where N is a number of particles. In this paper, we present a parallel implementation of the tree method running on a graphic processor unit (GPU). We successfully run a simulation of structure formation in the universe very efficiently. On our system, which costs roughly $900, the run with N ~ 2.87×10^6 particles took 5.79 hours and executed 1.2×10^13 force evaluations in total. We obtained the sustained computing speed of 21.8 Gflops and the cost per Gflops of 41.6/Gflops that is two and half times better than the previous record in 2006.

## CUDAEASY – a GPU Accelerated Cosmological Lattice Program

December 8th, 2009Abstract:

This paper presents, to the author’s knowledge, the first graphics processing unit (GPU) accelerated program that solves the evolution of interacting scalar fields in an expanding universe. We present the implementation in NVIDIA’s Compute Unified Device Architecture (CUDA) and compare the performance to other similar programs in chaotic inflation models. We report speedups between one and two orders of magnitude depending on the used hardware and software while achieving small errors in single precision. Simulations that used to last roughly one day to compute can now be done in hours and this difference is expected to increase in the future. The program has been written in the spirit of LATTICEEASY and users of the aforementioned program should find it relatively easy to start using CUDAEASY in lattice simulations. The program is available under the GNU General Public License.

The program is freely available at http://www.physics.utu.fi/theory/particlecosmology/cudaeasy/

(Jani Sainio. “CUDAEASY – a GPU Accelerated Cosmological Lattice Program”. submitted to Computer Physics Communications (under review). November 2009.)

## Path to Petascale: Adapting GEO/CHEM/ASTRO Applications for Accelerators and Accelerator Clusters

June 4th, 2009The goal of this workshop, held at the National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, was to help computational scientists in the geosciences, computational chemistry, and astronomy and astrophysics communities take full advantage of emerging high-performance computing resources based on computational accelerators, such as clusters with GPUs and Cell processors.

Slides are now available online and cover a wide range of topics including

- GPU and Cell programming tutorials
- GPU and Cell technology
- Accelerator programming, clusters, frameworks and building blocks such as sparse matrix-vector products, tree-based algorithms and in particular accelerator integration into large-scale established code bases
- Case studies and posters from geosciences, computational chemistry and astronomy/astrophysics such as the simulation of earthquakes, molecular dynamics, solar radiation, tsunamis, weather predictions, climate modeling and n-body systems as well as Monte-Carlo, Euler, Navier-Stokes and Lattice-Boltzmann type of simulations

(National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign: Path to Petascale workshop presentations, organized by Wen-mei Hwu, Volodymyr Kindratenko, Robert Wilhelmson, Todd Martínez and Robert Brunner)

## Path to Petascale: Adapting GEO/CHEM/ASTRO Applications for Accelerators and Accelerator Clusters

April 13th, 2009The workshop “Path to PetaScale: Adapting GEO/CHEM/ASTRO Applications for Accelerators and Accelerator Clusters” was held at the National Center for Supercomputing Applications (NCSA), University of Illinois Urbana-Champaign, on April 2-3, 2009. This workshop, sponsored by NSF and NCSA, helped computational scientists in the geosciences, computational chemistry, and astronomy and astrophysics communities take full advantage of emerging high-performance computing accelerators such as GPUs and Cell processors. The workshop consisted of joint technology sessions during the first day and domain-specific sessions on the second day. Slides from the presentations are now online.

## Graphic-Card Cluster for Astrophysics (GraCCA) — Performance Tests

July 27th, 2007Abstract: “In this paper, we describe the architecture and performance of the GraCCA system, a Graphic-Card Cluster for Astrophysics simulations. It consists of 16 nodes, with each node equipped with 2 modern graphic cards, the NVIDIA GeForce 8800 GTX. This computing cluster provides a theoretical performance of 16.2 TFLOPS. To demonstrate its performance in astrophysics computation, we have implemented a parallel direct N-body simulation program with shared time-step algorithm in this system. Our system achieves a measured performance of 7.1 TFLOPS and a parallel efficiency of 90% for simulating a globular cluster of 1024K particles. In comparing with the GRAPE-6A cluster at RIT (Rochester Institute of Technology), the GraCCA system achieves a more than twice higher measured speed and an even higher performance-per-dollar ratio. Moreover, our system can handle up to 320M particles and can serve as a general-purpose computing cluster for a wide range of astrophysics problems. (Hsi-Yu Schive, Chia-Hung Chien, Shing-Kwong Wong, Yu-Chih Tsai, Tzihong Chiueh. Graphic-Card Cluster for Astrophysics (GraCCA) — Performance Tests. submitted to New Astronomy, 20 July, 2007.)