Coordinating the Use of GPU and CPU for Improving Performance of Compute Intensive Applications

December 8th, 2009

Abstract:

GPUs have recently evolved into very fast parallel coprocessors capable of executing general-purpose computations extremely efficiently. At the same time, multicore CPUs evolution continued and today’s CPUs have 4-8 cores. These two trends, however, have followed independent paths in the sense that we are aware of very few works that consider both devices cooperating to solve general computations. In this paper we investigate the coordinated use of CPU and GPU to improve efficiency of applications even further than using either device independently. We use Anthill runtime environment, a data-flow oriented framework in which applications are decomposed into a set of event-driven filters, where for each event, the runtime system can use either GPU or CPU for its processing. For evaluation, we use a histopathology application that uses image analysis techniques to classify tumor images for neuroblastoma prognosis. Our experimental environment includes dual and octa-core machines, augmented with GPUs and we evaluate our approach’s performance for standalone and distributed executions. Our experiments show that a pure GPU optimization of the application achieved a factor of 15 to 49 times improvement over the single-core CPU version, depending on the versions of the CPUs and GPUs. We also show that the execution can be further reduced by a factor of about 2 by using our runtime system that effectively choreographs the execution to run cooperatively both on GPU and on a single core of CPU. We improve on that by adding more cores, all of which were previously neglected or used ineffectively. In addition, the evaluation on a distributed environment has shown near linear scalability to multiple hosts.

(George Teodoro, Rafael Sachetto, Olcay Sertel, Metin Gurcan, Wagner Meira Jr., Umit Catalyurek, and Renato Ferreira. Coordinating the Use of GPU and CPU for Improving Performance of Compute Intensive Applications. IEEE Cluster 2009. New Orleans, LA, USA. PresentationPaper.)

NVIDIA Tesla GPUs to Communicate Faster Over Mellanox InfiniBand Networks

November 25th, 2009

From a press release:

New Software Solution Reduces Dependency on CPUs

PORTLAND, Ore.- SC09-Nov. 18, 2009- NVIDIA Corporation (Nasdaq: NVDA) and Mellanox Technologies Ltd. today introduced new software that will increase cluster application performance by as much as 30% by reducing the latency that occurs when communicating over Mellanox InfiniBand to servers equipped with NVIDIA Tesla™ GPUs.

The system architecture of a GPU-CPU server requires the CPU to initiate and manage memory transfers between the GPU and the InfiniBand network. The new software solution will enable Tesla GPUs to transfer data to pinned system memory that a Mellanox InfiniBand solution is able to read and transmit over the network. The result is increased overall system performance and efficiency.

“NVIDIA Tesla GPUs deliver large increases in performance across each node in a cluster, but in our production runs on TSUBAME 1 we have found that network communication becomes a bottleneck when using multiple GPUs,” said Prof. Satoshi Matsuoka from Tokyo Institute of Technology. “Reducing the dependency on the CPU by using InfiniBand will deliver a major boost in performance in high performance GPU clusters, thanks to the work of NVIDIA and Mellanox, and will further enhance the architectural advances we will make in TSUBAME2.0.” Read the rest of this entry »

nHD – A Full Godunov Euler Equations Solver with CUDA/MPI

September 22nd, 2009

nHD is a multi-GPU 2nd order full Godunov three-dimensional uniform-mesh Euler equations solver for calorically ideal, compressible gas. nHD uses CUDA C with MPI and runs on a cluster of multi-GPU machines to accelerate computational hydrodynamics calculations.

Full Godunov method solves the hydrodynamic equations by discretizing the fluid and calculating the nonlinear evolution of the discretized distribution, using the analytic solutions for Riemann problems. Thus full Godunov method can resolve arbitrary severe shockwaves with minimum artificial dissipation and oscillation, and is the irreplaceable method for simulations of compressible fluid where shockwaves and vacuums are naturally generated from fluid motions.

nHD is open source under a BSD-style license and is available, and comments are welcome at http://code.google.com/p/astro-attic/wiki/NHDIntroduction.

Path to Petascale: Adapting GEO/CHEM/ASTRO Applications for Accelerators and Accelerator Clusters

June 4th, 2009

The goal of this workshop, held at the National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, was to help computational scientists in the geosciences, computational chemistry, and astronomy and astrophysics communities take full advantage of emerging high-performance computing resources based on computational accelerators, such as clusters with GPUs and Cell processors.

Slides are now available online and cover a wide range of topics including

  • GPU and Cell programming tutorials
  • GPU and Cell technology
  • Accelerator programming, clusters, frameworks and building blocks such as sparse matrix-vector products, tree-based algorithms and in particular accelerator integration into large-scale established code bases
  • Case studies and posters from geosciences, computational chemistry and astronomy/astrophysics such as the simulation of earthquakes, molecular dynamics, solar radiation, tsunamis, weather predictions, climate modeling and n-body systems as well as Monte-Carlo, Euler, Navier-Stokes and Lattice-Boltzmann type of simulations

(National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign: Path to Petascale workshop presentations, organized by Wen-mei Hwu, Volodymyr Kindratenko, Robert Wilhelmson, Todd Martínez and Robert Brunner)

Adapting a Message-Driven Parallel Application to GPU-Accelerated Clusters

November 18th, 2008

Graphics processing units (GPUs) have become an attractive option for accelerating scientific computations as a result of advances in the performance and flexibility of GPU hardware, and due to the availability of GPU software development tools targeting general purpose and scientific computation. However, effective use of GPUs in clusters presents a number of application development and system integration challenges. We describe strategies for the decomposition and scheduling of computation among CPU cores and GPUs, and techniques for overlapping communication and CPU computation with GPU kernel execution. We report the adaptation of these techniques to NAMD, a widely-used parallel molecular dynamics simulation package, and present performance results for a 64-core 64-GPU cluster. (Adapting a message-driven parallel application to GPU-accelerated clusters. James C. Phillips, John E. Stone, and Klaus Schulten. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing. Research web site)

Exploring weak scalability for FEM calculations on a GPU-enhanced cluster

November 15th, 2007

The first part of this paper by Göddeke et al. surveys co-processor approaches for commodity based clusters in general, not only with respect to raw performance, but also in view of their system integration and power consumption. We then extend previous work on a small GPU cluster by exploring the heterogeneous hardware approach for a large-scale system with up to 160 nodes. Starting with a conventional commodity based cluster we leverage the high bandwidth of graphics processing units (GPUs) to increase the overall system bandwidth that is the decisive performance factor in this scenario. Thus, even the addition of low-end, out of date GPUs leads to improvements in both performance- and power-related metrics. (Dominik Göddeke, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick McCormick, Sven H.M. Buijssen, Matthias Grajewski and Stefan Turek. Exploring weak scalability for FEM calculations on a GPU-enhanced cluster. Parallel Computing 33:10-11. pp. 685-699. 2007.)

Using GPUs to Improve Multigrid Solver Performance on a Cluster

November 15th, 2007

This article by Göddeke et al. explores the coupling of coarse and fine-grained parallelism for Finite Element simulations based on efficient parallel multigrid solvers. The focus lies on both system performance and a minimally invasive integration of hardware acceleration into an existing software package, requiring no changes to application code. Because of their excellent price performance ratio, we demonstrate the viability of our approach by using commodity graphics processors (GPUs) as efficient multigrid preconditioners. We address the issue of limited precision on GPUs by applying a mixed precision, iterative refinement technique. Other restrictions are also handled by a close interplay between the GPU and CPU. From a software perspective, we integrate the GPU solvers into the existing MPI-based Finite Element package by implementing the same interfaces as the CPU solvers, so that for the application programmer they are easily interchangeable. Our results show that we do not compromise any software functionality and gain speedups of two and more for large problems. Equipped with this additional option of hardware acceleration we compare different choices in increasing the performance of a conventional, commodity based cluster by increasing the number of nodes, replacement of nodes by a newer technology generation, and adding powerful graphics cards to the existing nodes. (Dominik Göddeke, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick McCormick, Hilmar Wobker, Christian Becker and Stefan Turek. Using GPUs to Improve Multigrid Solver Performance on a Cluster. Accepted for publication in the International Journal of Computational Science and Engineering.)

GPU Cluster for High Performance Computing

August 19th, 2004

This paper by Fan et. al. at Stony Brook University presents the use of a cluster of commodity GPUs for high performance scientific computing. As an example application, they have developed a parallel flow simulation using the lattice Boltzmann model (LBM) on a GPU cluster and have simulated the dispersion of airborne contaminants in the Times Square area of New York City. Using 30 GPU nodes, their simulation can compute a 480 x 400 x 80 LBM in 0.31 second/step, a speed which is 4.6 times faster than that of their previous CPU cluster implementation. Besides the LBM, the paper also discusses other potential applications of the GPU cluster, such as cellular automata, PDE solvers, and FEM. (Zhe Fan, Feng Qiu, Arie Kaufman, Suzanne Yoakum-Stover, GPU Cluster for High Performance Computing, To Appear in Proceedings of the ACM/IEEE SuperComputing 2004 (SC’04), November, 2004)

Page 2 of 212