A Multi-GPU Programming Library for Real-Time Applications

January 11th, 2013


We present MGPU, a C++ programming library targeted at single-node multi-GPU systems. Such systems combine disproportionate floating point performance with high data locality and are thus well suited to implement real-time algorithms. We describe the library design, programming interface and implementation details in light of this specific problem domain. The core concepts of this work are a novel kind of container abstraction and MPI-like communication methods for intra-system communication. We further demonstrate how MGPU is used as a framework for porting existing GPU libraries to multi-device architectures. Putting our library to the test, we accelerate an iterative non-linear image reconstruction algorithm for real-time magnetic resonance imaging using multiple GPUs. We achieve a speed-up of about 1.7 using 2 GPUs and reach a final speed-up of 2.1 with 4 GPUs. These promising results lead us to conclude that multi-GPU systems are a viable solution for real-time MRI reconstruction as well as signal-processing applications in general.

(Sebastian Schaetz and Martin Uecker: “A Multi-GPU Programming Library for Real-Time Applications”,  Algorithms and Architectures for Parallel Processing (2012): 114-128. [DOI] [ARXIV])

rCUDA 4.0 released

December 18th, 2012

rCUDA (remote CUDA) v4.0 has just been released. It provides full binary compatibility with CUDA applications (no need to modify the application source code or recompile your program), native InfiniBand support, enhanced data transfers, and CUDA 5.0 API support (excluding graphics interoperability). This new release of rCUDA allows to execute existing GPU-accelerated applications by leveraging remote GPUs within a cluster (both via sharing and/or aggregating GPUs) with a negligible overhead. The new version is available free of charge ar www.rCUDA.net, along with examples, manuals and additional information.

Webinar: Scaling Soft Matter Physics to a Thousand GPUs and Beyond

September 22nd, 2012

The “Ludwig” lattice Boltzmann fluid dynamics application is a versatile application capable of simulating the hydrodynamics of complex fluids, (e.g. mixtures, surficants, liquid crystals, particle suspensions) to allow cutting-edge research into condensed matter physics. On October 3, Dr. Alan Gray from the University of Edinburgh presents a webinar on his team’s experiences in scaling the application on the Cray XK6 hybrid supercomputer. The presentation will cover:

  • A review of excellent scaling up to O(1000) GPUs
  • Steps taken to maximize performance on each GPU
  • Designing the communication to allow efficient usage of many GPUs in parallel, including the overlapping of several stages using CUDA stream functionality
  • Advanced functionality, including how to include colloidal particles in the simulation while minimizing data transfer overheads

Register at http://www.gputechconf.com/page/gtc-express-webinar.html.

New rCUDA version beta testing

April 18th, 2012

The rCUDA Team is proud to announce a new version of the rCUDA framework which will include many new functionalities as well as boosted performance. This new version, cooked for over a year, will incorporate pipelined transfers, full multi-thread and multi-node capabilities, CUDA 4.1 support, global scheduler integration, support for CUDA C extensions, and native InfiniBand support. A closed beta teting program has been started. See the complete text at http://www.rcuda.net/index.php/news/19-new-revolutionary-version-of-rcuda-to-be-launched.html.

CUDA 4.0 Release Aims to Make Parallel Programming Easier

March 1st, 2011

Today NVIDIA announced the upcoming 4.0 release of CUDA.  While most of the major CUDA releases accompanied a new GPU architecture, 4.0 is a software-only release, but that doesn’t mean there aren’t a lot of new features.  With this release, NVIDIA is aiming to lower the barrier to entry to parallel programming on GPUs, with new features including easier multi-GPU programming, a unified virtual memory address space, the powerful Thrust C++ template library, and automatic performance analysis in the Visual Profiler tool.  Full details follow in the quoted press release below.

Read the rest of this entry »

OpenFOAM SpeedIT plugin 1.1 released

November 27th, 2010

The OpenFOAM SpeedIT plugin version 1.1 has been released under the GPL License. The most important new features are:

  • Multi-GPU support
  • Tested on Fermi architecture (GTX460 and Tesla C2050)
  • Automated submission of the domain to the GPU cards (using decomposePar from OpenFOAM)
  • Optimized submission of computational tasks to the best GPU card in the system for any number of computational threads
  • Plugin picks the most powerful GPU card for a single thread cases

The OpenFOAM SpeedIT plugin is available at http://speedit.vratis.com.

rCUDA™ 2.0 released

November 27th, 2010

A new major release of rCUDA™ (Remote CUDA), the Open Source package that allows performing CUDA calls to remote GPUs, has been released. The major improvements included in the new version are:

  • Updated API to 3.1
  • Server now uses Runtime API when possible (CUDA >= 3.1 required)
  • Introduced support for the most common CUBLAS routines
  • Fixed some bugs
  • Added AF_UNIX sockets support to enhance performance on local executions
  • Added some load balancing capabilities to the server
  • General performance improvements
  • Officially added Fermi support

Further information is available from the rCUDA™ webpages http://www.gap.upv.es/rCUDA and http://www.hpca.uji.es/rCUDA.