High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster

June 23rd, 2010

Abstract:

We implement a high-order finite-element application, which performs the numerical simulation of seismic wave propagation resulting for instance from earthquakes at the scale of a continent or from active seismic acquisition experiments in the oil industry, on a large cluster of NVIDIA Tesla graphics cards using the CUDA programming environment and non-blocking message passing based on MPI. Contrary to many finite-element implementations, ours is implemented successfully in single precision, maximizing the performance of current generation GPUs. We discuss the implementation and optimization of the code and compare it to an existing very optimized implementation in C language and MPI on a classical cluster of CPU nodes. We use mesh coloring to efficiently handle summation operations over degrees of freedom on an unstructured mesh, and non-blocking MPI messages in order to overlap the communications across the network and the data transfer to and from the device via PCIe with calculations on the GPU. We perform a number of numerical tests to validate the single-precision CUDA and MPI implementation and assess its accuracy. We then analyze performance measurements and depending on how the problem is mapped to the reference CPU cluster, we obtain a speedup of 20x or 12x.

(Dimitri Komatisch, Gordon Erlebacher, Dominik Göddeke and David Michéa: “High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster”, accepted for publication in: Journal of Computational Physics, Jun. 2010. PDF preprint. DOI link.)

Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid

March 3rd, 2010

Abstract:

We have previously suggested mixed precision iterative solvers specifically tailored to the iterative solution of sparse linear equation systems as they typically arise in the finite element discretization of partial differential equations. These schemes have been evaluated for a number of hardware platforms, in particular single precision GPUs as accelerators to the general purpose CPU. This paper reevaluates the situation with new mixed precision solvers that run entirely on the GPU: We demonstrate that mixed precision schemes constitute a significant performance gain over native double precision. Moreover, we present a new implementation of cyclic reduction for the parallel solution of tridiagonal systems and employ this scheme as a line relaxation smoother in our GPU-based multigrid solver. With an alternating direction implicit variant of this advanced smoother we can extend the applicability of the GPU multigrid solvers to very ill-conditioned systems arising from the discretization on anisotropic meshes, that previously had to be solved on the CPU. The resulting mixed precision schemes are always faster than double precision alone, and outperform tuned CPU solvers consistently by almost an order of magnitude.

(Dominik Göddeke and Robert Strzodka: “Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid” , accepted in: IEEE Transactions on Parallel and Distributed Systems, Special Issue: High Performance Computing with Accelerators, Mar. 2010. Link.)

CheCUDA: A Checkpoint/restart Tool for CUDA Applications

November 25th, 2009

In this paper, Takizawa et al. have presented a tool named CheCUDA that is designed to checkpoint CUDA applications. As existing checkpoint/restart implementations do not support checkpointing the GPU status, CheCUDA hooks basic CUDA driver API calls in order to record the GPU status changes on the main memory. At checkpointing, CheCUDA stores the status changes in a file after copying all necessary data in the video memory to the main memory and then disabling the CUDA runtime. At restart, CheCUDA reads the file, re-initializes the CUDA runtime, and recovers the resources on GPUs so as to restart from the stored status. This paper demonstrates that a prototype implementation of CheCUDA can correctly checkpoint and restart a CUDA application written with basic APIs. This also indicates that CheCUDA can migrate a process from one PC to another even if the process uses a GPU. Accordingly, CheCUDA is useful not only to enhance the dependability of CUDA applications but also to enable dynamic task scheduling of CUDA applications required especially on heterogeneous GPU cluster systems. This paper also shows the timing overhead for checkpointing.

(Hiroyuki Takizawa, Katuto Sato, Kazuhiko Komatsu, and Hiroaki Kobayashi, CheCUDA: A Checkpoint/Restart Tool for CUDA Applications, to appear inProceedings of the Tenth International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT) 2009, Workshop on Ultra Performance and Dependable Acceleration Systems).

OpenCurrent v1.0 released: CUDA-accelerated PDE solver

September 28th, 2009

OpenCurrent is an open source C++ library for solving Partial Differential Equations (PDEs) over regular grids using the CUDA platform from NVIDIA. It breaks down a PDE into 3 basic objects, “Grids”, “Solvers,” and “Equations.” “Grid” data structures efficiently implement regular 1D, 2D, and 3D arrays in both double and single precision. Grids support operations like computing linear combinations, managing host-device memory transfers, interpolating values at non-grid points, and performing array-wide reductions. “Solvers” use these data structures to calculate terms arising from discretizations of PDEs, such as finite-difference based advection and diffusion schemes, and a multigrid solver for Poisson equations. These computational building blocks can be assembled into complete “Equation” objects that solve time-dependent PDEs. One such Equation solver is an incompressible Navier-Stokes solver that uses a second-order Boussinesq model. This equation solver is fully validated, and has been used to study Rayleigh-Benard convection under a variety of different regimes. Benchmarks show it to perform about 8 times faster than an equivalent Fortran code running on an 8-core Xeon.

Read the rest of this entry »

nHD – A Full Godunov Euler Equations Solver with CUDA/MPI

September 22nd, 2009

nHD is a multi-GPU 2nd order full Godunov three-dimensional uniform-mesh Euler equations solver for calorically ideal, compressible gas. nHD uses CUDA C with MPI and runs on a cluster of multi-GPU machines to accelerate computational hydrodynamics calculations.

Full Godunov method solves the hydrodynamic equations by discretizing the fluid and calculating the nonlinear evolution of the discretized distribution, using the analytic solutions for Riemann problems. Thus full Godunov method can resolve arbitrary severe shockwaves with minimum artificial dissipation and oscillation, and is the irreplaceable method for simulations of compressible fluid where shockwaves and vacuums are naturally generated from fluid motions.

nHD is open source under a BSD-style license and is available, and comments are welcome at http://code.google.com/p/astro-attic/wiki/NHDIntroduction.

SPEEDUP and PPAM Conference Tutorials Available

September 16th, 2009

Slides from two full-day conference tutorials are now available:

Both tutorials present basics and advanced topics of scientific computing on GPUs, including ready-to-use GPU libraries, GPU architecture, case studies and many hands-on examples.

VMD 1.8.7 release supports CUDA on MacOS X, Linux, Windows

August 31st, 2009

VMD is a molecular visualization program for building, displaying, and analyzing large biomolecular systems using 3-D graphics and built-in scripting. One of the key advancements included in VMD 1.8.7 is support for GPU-accelerated visualization and analysis, based on CUDA. VMD uses CUDA to accelerate several of its most computationally demanding algorithms, with additional modules planned for GPU acceleration in upcoming releases. Typical GPU acceleration factors for the algorithms in VMD are: electrostatics 22x to 44x, implicit ligand sampling 20x to 30x, molecular orbital calculation 100x to 120x.

CECAM Workshop: Algorithmic Re-Engineering for Modern Non-Conventional Processing Units

July 16th, 2009

This 3-day workshop, to be held  September 30, 2009 to October 2, 2009 in Lugano, Switzerland, will explore the use of GPUs, Cell BE processors FPGAs and special-purpose hardware for large-scale scientific computing.

Similar to the 1990s, when the revolution in mainstream scientific software development, viz. going from structured programming to object-oriented programming, was the greatest change in the past 3 decades, we are at the beginning of a totally new revolution in terms of algorithmic engineering.

We are nowadays at a hardware/software technology inflection point due to large-scale parallelism, including parallel operations on the contents of a single register, pipelining, memory pre-fetch, single-core simultaneous multithreading (”hyper-threading”) and superscalar instruction issue. Some new processor options have emerged, such as the Cell BE processor and GPUs, which are extremely aggressive in their use of parallelism, while keeping, on the other hand, general-purpose programmability. Other processors, like FPGAs and special purpose hardware, still based on chip parallelism, are emerging for being extremely and efficiently specialized for unique tasks.

The main objective will be to demonstrate how some of the most challenging problems in computational sciences have already been ported to modern non-conventional computing platforms, presented by speakers coming from a wide computational community (physicists, chemists, engineers, computer scientists, biologists) active in the fields of algorithm re-engineering for the new architectures.

Workshop: Massively-Parallel Computational Biology on GPUs

March 31st, 2009

This workshop, organized in conjunction with INFORMATIK 2009, the 39th annual meeting of the Gesellschaft für Informatik e.V. (GI). This one day event will take place in Lübeck Germany, during the duration of INFORMATIK 2009 (September 28th – October 2nd, 2009). The workshop will include tutorials, refereed sessions, invited talks, and an open discussion session on future developments. Submissions are encouraged in all areas of Massively-Parallel Computational Biology on GPUs (Graphics Processing Units) including but not limited to

  • Parallel and massively-parallel Programming and Algorithms
  • Algorithmic Aspects of Computational Biology
  • Applications and Implementations on GPUs

The submission deadline is April 26, 2009.  For more information visit the BioGPU 2009 Website.

SIAM CSE’09: Scientific Computing on Emerging Many-Core Architectures

March 17th, 2009

Slides are now available for the minisymposium “Scientific Computing on Emerging Many-Core architectures”, held in conjunction with the SIAM Conference on Computational Science and Engineering 2009 (SIAM CSE’09, Miami, Florida). The minisymposium, organised by Mike Giles, Dominik Göddeke and Stefan Turek, focused on opportunities and challenges for scientific computing on novel many-core architectures, in particular IBM’s Cell processor and GPUs from NVIDIA, AMD and Intel. The talks covered a range of application areas, including the development of libraries and other tools to simplify the programming many-core processors. (Minisymposium: Scientific Computing on Emerging Many-Core architectures)

Page 1 of 3123