Message Passing on GPUs and Data-Parallel Architectures

March 11th, 2009

Abstract:

This paper explores the challenges in implementing a message passing interface usable on systems with data-parallel processors. As a case study, we design and implement the “DCGN” API on NVIDIA GPUs that is similar to MPI and allows full access to the underlying architecture. We introduce the notion of data-parallel thread-groups as a way to map resources to MPI ranks. We use a method that also allows the data-parallel processors to run autonomously from user-written CPU code. In order to facilitate communication, we use a sleep-based polling system to store and retrieve messages. Unlike previous systems, our method provides both performance and flexibility. By running a test suite of applications with different communication requirements, we find that a tolerable amount of overhead is incurred, somewhere between one and five percent depending on the application, and indicate the locations where this overhead accumulates. We conclude that with innovations in chipsets and drivers, this overhead will be mitigated and provide similar performance to typical CPU based MPI implementations while providing fully-dynamic communication.

(Jeff A. Stuart and John D. Owens, Message Passing on Data-Parallel Architectures, Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium)

First GPU-Based Heterogeneous Cluster Joins the Top 500

November 19th, 2008

This is a GPGPU event a long time in the making.  Since the advent of general-purpose APIs and compilers for GPUs it has been predicted that GPUs would one day be used to help boost the performance of Supercomputers.  With the latest release of the Top500 Supercomputer list, that prediction has become a reality.

More details from an NVIDIA press release:

NVIDIA Tesla Powers 29th Most Powerful Supercomputer in the World

Tesla S1070

SC08—AUSTIN, TX—NOVEMBER 17, 2008—The Tokyo Institute of Technology (Tokyo Tech) today announced a collaboration with NVIDIA to use NVIDIA® Tesla™ GPUs to boost the computational horsepower of its TSUBAME supercomputer. Through the addition of 170 Tesla S1070 1U systems, the TSUBAME supercomputer now delivers nearly 170 TFLOPS of theoretical peak performance, as well as 77.48 TFLOPS of measured Linpack performance, placing it, again, amongst the top ranks in the world’s Top 500 Supercomputers.

“Tokyo Tech is constantly investigating future computing platforms and it had become clear to us that to make the next major leap in performance, TSUBAME had to adopt GPU computing technologies,” said Satoshi Matsuoka, division director of the Global Scientific Information and Computing Center at Tokyo Tech. “In testing our key applications, the Tesla GPUs delivered speed-ups that we had never seen before, sometimes even orders of magnitude – a tremendous competitive boost for our scientists and engineers in reducing their time to solution.”

Speaking to the ease of implementation, Matsuoka continued,

“The entire upgrade was carried out in 1 week, and the TSUBAME supercomputer remained live throughout. This is an unprecedented feat in top-level supercomputing.”

Read the rest of this entry »

NVIDIA Announces Availability of Tesla™ Personal Supercomputer

November 18th, 2008

From a press release:

NVIDIA Tesla Makes Personal SuperComputing A Reality

Tesla GPUs Enable Cluster Class Performance On The Desktop at 1/10th The Power

SC08—AUSTIN, TX—NOVEMBER 18 2008— Today, scientific research is carried out on supercomputing clusters, a shared resource that consumes hundreds of kilowatts of power and costs millions of dollars to build and maintain. As a result, researchers must fight for time on these resources, slowing their work and delaying results. NVIDIA and its worldwide partners today announced the availability of the GPU-based Tesla™ Personal Supercomputer, which delivers the equivalent computing power of a cluster, at 1/100th of the price and in a form factor of a standard desktop workstation.

Read the rest of this entry »

Adapting a Message-Driven Parallel Application to GPU-Accelerated Clusters

November 18th, 2008

Graphics processing units (GPUs) have become an attractive option for accelerating scientific computations as a result of advances in the performance and flexibility of GPU hardware, and due to the availability of GPU software development tools targeting general purpose and scientific computation. However, effective use of GPUs in clusters presents a number of application development and system integration challenges. We describe strategies for the decomposition and scheduling of computation among CPU cores and GPUs, and techniques for overlapping communication and CPU computation with GPU kernel execution. We report the adaptation of these techniques to NAMD, a widely-used parallel molecular dynamics simulation package, and present performance results for a 64-core 64-GPU cluster. (Adapting a message-driven parallel application to GPU-accelerated clusters. James C. Phillips, John E. Stone, and Klaus Schulten. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing. Research web site)

NCSA to add 62 teraflops of compute power with new heterogeneous system

October 16th, 2008

The following is excerpted from an NVIDIA press release.

Installation has begun on a new computational resource at the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign. Lincoln will deliver peak performance of 62.3 teraflops and is designed to push the envelope in the use of heterogeneous processors for scientific computing. The system is expected to be online in October, bringing NCSA’s total computational resources to nearly 170 teraflops.

Lincoln will consist of 192 compute nodes (Dell PowerEdge 1950 III dual-socket nodes with quad-core Intel Harpertown 2.33GHz processors and 16GB of memory) and 96 NVIDIA Tesla S1070 accelerator units. Each Tesla unit provides 500 gigaflops of double-precision performance and 16GB of memory. Lincoln’s InfiniBand interconnect fabric will be linked to the interconnect fabric of Abe, the 89-teraflop cluster that is currently NCSA’s largest resource. This will enable certain applications to run across the entire complex, providing a peak “Abe Lincoln” performance of 152 teraflops.

(Press Release)

Exploring weak scalability for FEM calculations on a GPU-enhanced cluster

November 15th, 2007

The first part of this paper by Göddeke et al. surveys co-processor approaches for commodity based clusters in general, not only with respect to raw performance, but also in view of their system integration and power consumption. We then extend previous work on a small GPU cluster by exploring the heterogeneous hardware approach for a large-scale system with up to 160 nodes. Starting with a conventional commodity based cluster we leverage the high bandwidth of graphics processing units (GPUs) to increase the overall system bandwidth that is the decisive performance factor in this scenario. Thus, even the addition of low-end, out of date GPUs leads to improvements in both performance- and power-related metrics. (Dominik Göddeke, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick McCormick, Sven H.M. Buijssen, Matthias Grajewski and Stefan Turek. Exploring weak scalability for FEM calculations on a GPU-enhanced cluster. Parallel Computing 33:10-11. pp. 685-699. 2007.)

Using GPUs to Improve Multigrid Solver Performance on a Cluster

November 15th, 2007

This article by Göddeke et al. explores the coupling of coarse and fine-grained parallelism for Finite Element simulations based on efficient parallel multigrid solvers. The focus lies on both system performance and a minimally invasive integration of hardware acceleration into an existing software package, requiring no changes to application code. Because of their excellent price performance ratio, we demonstrate the viability of our approach by using commodity graphics processors (GPUs) as efficient multigrid preconditioners. We address the issue of limited precision on GPUs by applying a mixed precision, iterative refinement technique. Other restrictions are also handled by a close interplay between the GPU and CPU. From a software perspective, we integrate the GPU solvers into the existing MPI-based Finite Element package by implementing the same interfaces as the CPU solvers, so that for the application programmer they are easily interchangeable. Our results show that we do not compromise any software functionality and gain speedups of two and more for large problems. Equipped with this additional option of hardware acceleration we compare different choices in increasing the performance of a conventional, commodity based cluster by increasing the number of nodes, replacement of nodes by a newer technology generation, and adding powerful graphics cards to the existing nodes. (Dominik Göddeke, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick McCormick, Hilmar Wobker, Christian Becker and Stefan Turek. Using GPUs to Improve Multigrid Solver Performance on a Cluster. Accepted for publication in the International Journal of Computational Science and Engineering.)

CUDA Tutorial at Supercomputing 2007

August 22nd, 2007

On Sunday November 11 2007 at SC07 in Reno NVIDIA will host a full-day tutorial on CUDA. In this tutorial NVIDIA engineers will partner with academic and industrial researchers to present CUDA and discuss its advanced use for science and engineering domains. The morning session will introduce CUDA programming and the execution and memory models at its heart, motivate the use of CUDA with many brief examples from different HPC domains, and discuss fundamental algorithmic building blocks in CUDA. The afternoon will discuss advanced issues such as optimization and “tips & tricks”, and include real-world case studies from domain scientists using CUDA (VMD and NAMD Molecular Dynamics and Oil and Gas).
Follow this link for more information: http://sc07.supercomputing.org/schedule/event_detail.php?evid=11034.

Supercomputing ’06 GPGPU Workshop Proceedings Posted

February 2nd, 2007

The proceedings of the workshop “General-Purpose GPU Computing: Practice And Experience” held at SuperComputing 2006 are now posted. The proceedings include PDFs of the workshop presentations and posters. (http://www.gpgpu.org/sc2006/workshop/)

GPGPU gets Wired: "Supercomputing’s Next Revolution"

November 10th, 2006

Wired magazine has published an article about GPGPU by Paul Tulloch called “Supercomputing’s Next Revolution”. The article discusses recent results from the Stanford Folding@Home project and the UNC Gamma Group, whose most resent results will be presented next week at Supercomputing 2006 in Tampa, Florida.

Page 2 of 3123