PARALUTION Release 1.0

April 13th, 2015

PARALUTION is a library for sparse iterative methods which can be performed on various parallel devices, including multi-core CPU, GPU (CUDA and OpenCL) and Intel Xeon Phi.

The 1.0 version of the PARALUTION Library supports multi-node and multi-GPU configuration via MPI. All iterative solvers support global operations (i.e. distributed matrices and vectors) and all preconditioners can be used in a block-Jacobi fashion locally on each node/GPU. In addition, the software provides a global (fully distributed) Pair-Wise AMG solver. Read the rest of this entry »

Multi-GPU Implementation of the Minimum Volume Simplex Analysis Algorithm for Hyperspectral Unmixing

April 29th, 2014

Abstract :

Spectral unmixing is an important task in remotely sensed hyperspectral data exploitation. The linear mixture model has been widely used to unmix hyperspectral images by identifying a set of pure spectral signatures, called endmembers, and estimating their respective abundances in each pixel of the scene. Several algorithms have been proposed in the recent literature to automatically identify endmembers, even if the original hyperspectral scene does not contain any pure signatures. A popular strategy for endmember identification in highly mixed hyperspectral scenes has been the minimum volume simplex analysis (MVSA), known to be a computationally very expensive algorithm. This algorithm calculates the minimum volume enclosing simplex, as opposed to other algorithms that perform maximum simplex volume analysis (MSVA). The high computational complexity of MVSA, together with its very high memory requirements, has limited its adoption in the hyperspectral imaging community. In this paper we develop several optimizations to the MVSA algorithm. The main computational task of MVSA is the solution of a quadratic optimization problem with equality and inequality constraints, with the inequality constraints being in the order of the number of pixels multiplied by the number of endmembers. As a result, storing and computing the inequality constraint matrix is highly inefficient. The first optimization presented in this paper uses algebra operations in order to reduce the memory requirements of the algorithm. In the second optimization, we use graphics processing units (GPUs) to effectively solve (in parallel) the quadratic optimization problem involved in the computation of MVSA. In the third optimization, we extend the single GPU implementation to a multi-GPU one, developing a hybrid strategy that distributes the computation while taking advantage of GPU accelerators at each node. The presented optimizations are tested in different analysis scenarios (using both synthetic and real hyperspectral data) and shown to provide state-of-the-art results from the viewpoint of unmixing accuracy and computational performance. The speedup achieved using the full GPU cluster compared to the CPU implementation in tenfold in a real hyperspectral image.

(A. Agathos, J. Li, D. Petcu and A. Plaza: “Multi-GPU Implementation of the Minimum Volume Simplex Analysis Algorithm for Hyperspectral Unmixing”. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, accepted for publication , 2014. [PDF] )

A Multi-GPU Programming Library for Real-Time Applications

January 11th, 2013


We present MGPU, a C++ programming library targeted at single-node multi-GPU systems. Such systems combine disproportionate floating point performance with high data locality and are thus well suited to implement real-time algorithms. We describe the library design, programming interface and implementation details in light of this specific problem domain. The core concepts of this work are a novel kind of container abstraction and MPI-like communication methods for intra-system communication. We further demonstrate how MGPU is used as a framework for porting existing GPU libraries to multi-device architectures. Putting our library to the test, we accelerate an iterative non-linear image reconstruction algorithm for real-time magnetic resonance imaging using multiple GPUs. We achieve a speed-up of about 1.7 using 2 GPUs and reach a final speed-up of 2.1 with 4 GPUs. These promising results lead us to conclude that multi-GPU systems are a viable solution for real-time MRI reconstruction as well as signal-processing applications in general.

(Sebastian Schaetz and Martin Uecker: “A Multi-GPU Programming Library for Real-Time Applications”,  Algorithms and Architectures for Parallel Processing (2012): 114-128. [DOI] [ARXIV])

rCUDA 4.0 released

December 18th, 2012

rCUDA (remote CUDA) v4.0 has just been released. It provides full binary compatibility with CUDA applications (no need to modify the application source code or recompile your program), native InfiniBand support, enhanced data transfers, and CUDA 5.0 API support (excluding graphics interoperability). This new release of rCUDA allows to execute existing GPU-accelerated applications by leveraging remote GPUs within a cluster (both via sharing and/or aggregating GPUs) with a negligible overhead. The new version is available free of charge ar, along with examples, manuals and additional information.

Webinar: Scaling Soft Matter Physics to a Thousand GPUs and Beyond

September 22nd, 2012

The “Ludwig” lattice Boltzmann fluid dynamics application is a versatile application capable of simulating the hydrodynamics of complex fluids, (e.g. mixtures, surficants, liquid crystals, particle suspensions) to allow cutting-edge research into condensed matter physics. On October 3, Dr. Alan Gray from the University of Edinburgh presents a webinar on his team’s experiences in scaling the application on the Cray XK6 hybrid supercomputer. The presentation will cover:

  • A review of excellent scaling up to O(1000) GPUs
  • Steps taken to maximize performance on each GPU
  • Designing the communication to allow efficient usage of many GPUs in parallel, including the overlapping of several stages using CUDA stream functionality
  • Advanced functionality, including how to include colloidal particles in the simulation while minimizing data transfer overheads

Register at

New rCUDA version beta testing

April 18th, 2012

The rCUDA Team is proud to announce a new version of the rCUDA framework which will include many new functionalities as well as boosted performance. This new version, cooked for over a year, will incorporate pipelined transfers, full multi-thread and multi-node capabilities, CUDA 4.1 support, global scheduler integration, support for CUDA C extensions, and native InfiniBand support. A closed beta teting program has been started. See the complete text at

CUDA 4.0 Release Aims to Make Parallel Programming Easier

March 1st, 2011

Today NVIDIA announced the upcoming 4.0 release of CUDA.  While most of the major CUDA releases accompanied a new GPU architecture, 4.0 is a software-only release, but that doesn’t mean there aren’t a lot of new features.  With this release, NVIDIA is aiming to lower the barrier to entry to parallel programming on GPUs, with new features including easier multi-GPU programming, a unified virtual memory address space, the powerful Thrust C++ template library, and automatic performance analysis in the Visual Profiler tool.  Full details follow in the quoted press release below.

Read the rest of this entry »

OpenFOAM SpeedIT plugin 1.1 released

November 27th, 2010

The OpenFOAM SpeedIT plugin version 1.1 has been released under the GPL License. The most important new features are:

  • Multi-GPU support
  • Tested on Fermi architecture (GTX460 and Tesla C2050)
  • Automated submission of the domain to the GPU cards (using decomposePar from OpenFOAM)
  • Optimized submission of computational tasks to the best GPU card in the system for any number of computational threads
  • Plugin picks the most powerful GPU card for a single thread cases

The OpenFOAM SpeedIT plugin is available at

rCUDA™ 2.0 released

November 27th, 2010

A new major release of rCUDA™ (Remote CUDA), the Open Source package that allows performing CUDA calls to remote GPUs, has been released. The major improvements included in the new version are:

  • Updated API to 3.1
  • Server now uses Runtime API when possible (CUDA >= 3.1 required)
  • Introduced support for the most common CUBLAS routines
  • Fixed some bugs
  • Added AF_UNIX sockets support to enhance performance on local executions
  • Added some load balancing capabilities to the server
  • General performance improvements
  • Officially added Fermi support

Further information is available from the rCUDA™ webpages and