Partnering with NVIDIA, this four day CUDA training course, held in Houston is designed for programmers in the oil and gas industry who are looking to develop comprehensive skills in writing and optimizing applications that fully leverage the many-core processing capabilities of the GPU. Commonly used algorithms such as filtering and FFTs will be used and profiled in the examples. The case study on day 4 focuses on efficient implementation of a finite difference algorithm which is highly applicable to reverse time migration. However a background in oil and gas is not necessary. For more information and to view a copy of the course outline please visit: http://acceleware.com/training/987

## CUDA Course Sept 2-5, 2014, Houston

July 22nd, 2014## rCUDA 4.2 version available

June 19th, 2014A new version of the rCUDA middleware has been released (version 4.2). In addition to fix some minor bugs, the new release provides support for:

- CUDA 6.0 Runtime API
- New stream management
- cuSPARSE libraries

The rCUDA middleware allows to seamlessly use, within your cluster, GPUs that are installed in computing nodes different from the one that is executing the CUDA application, without requiring to modify your program. Please visit www.rcuda.net for more details about the rCUDA technology.

## On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications

June 8th, 2014Abstract:

Many current high-performance clusters include one or more GPUs per node in order to dramatically reduce application execution time, but the utilization of these accelerators is usually far below 100%. In this context, emote GPU virtualization can help to reduce acquisition costs as well as the overall energy consumption. In this paper, we investigate the potential overhead and bottlenecks of several “heterogeneous” scenarios consisting of client GPU-less nodes running CUDA applications and remote GPU-equipped server nodes providing access to NVIDIA hardware accelerators. The experimental evaluation is performed using three general-purpose multicore processors (Intel Xeon, Intel Atom and ARM Cortex A9), two graphics accelerators (NVIDIA GeForce GTX480 and NVIDIA Quadro M1000), and two relevant scientific applications (CUDASW++ and LAMMPS) arising in bioinformatics and molecular dynamics simulations.

(A. Castelló, J. Duato, R. Mayo, A. J. Peña, E. S. Quintana-Ortí, V. Roca, and F. Silla, *“On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications”*. Fourth International Conference on Smart Grids, Green Communications and IT Energy-aware Technologies, ENERGY 2014, Chamonix (France), pp. 57–62, 20 – 24 April 2014. [PDF])

## Improving Cache Locality for GPU-based Volume Rendering

June 8th, 2014Abstract:

We present a cache-aware method for accelerating texture-based volume rendering on a graphics processing unit (GPU). Because a GPU has hierarchical architecture in terms of processing and memory units, cache optimization is important to maximize performance for memory-intensive applications. Our method localizes texture memory reference according to the location of the viewpoint and dynamically selects the width and height of thread blocks (TBs) so that each warp, which is a series of 32 threads processed simultaneously, can minimize memory access strides. We also incorporate transposed indexing of threads to perform TB-level cache optimization for specific viewpoints. Furthermore, we maximize TB size to exploit spatial locality with fewer resident TBs. For viewpoints with relatively large strides, we synchronize threads of the same TB at regular intervals to realize synchronous ray propagation. Experimental results indicate that our cache-aware method doubles the worst rendering performance compared to those provided by the CUDA and OpenCL software development kits.

(Yuki Sugimoto, Fumihiko Ino, and Kenichi Hagihara: *“Improving Cache Locality for GPU-based Volume Rendering”*. Parallel Computing 40(5/6): 59-69, May 2014. [DOI])

## PARALUTION 0.7.0 released

May 27th, 2014PARALUTION is a library for sparse iterative methods which can be performed on various parallel devices, including multi-core CPU, GPU (CUDA and OpenCL) and Intel Xeon Phi. The new 0.7.0 version provides the following new features:

- Windows support – full windows support for all backends (CUDA, OpenCL, OpenMP)
- Assembling function – new OpenMP parallel assembling function for sparse matrices (includes an update function for time-dependent problems)
- Direct (dense) solvers (for very small problems)
- (Restricted) Additive Schwarz preconditioners
- MATLAB/Octave plug-in

To avoid OpenMP overhead for small sized problems, the library will compute in serial if the size of the matrix/vector is below a pre-defined threshold. Internally, the OpenCL backend has been modified for simplified cross platform compilation.

## GPUPROF 0.3 Released

May 15th, 2014A new version of the GPU-profiler for CUDA software stack is available at www.lab4241.com. The GPU-profiler is able to deliver per C++ source-code ‘inside’ kernel performance information in a simple, intuitive way, similar to known CPU domain profilers, like Quantify or Valgrind. The new version, GPUPROF version 0.3 (beta), includes improved stability, refined memory tracing, temporal memory analysis, and CUDA API-driver call tracing.

## Multi-GPU Implementation of the Minimum Volume Simplex Analysis Algorithm for Hyperspectral Unmixing

April 29th, 2014Abstract :

Spectral unmixing is an important task in remotely sensed hyperspectral data exploitation. The linear mixture model has been widely used to unmix hyperspectral images by identifying a set of pure spectral signatures, called endmembers, and estimating their respective abundances in each pixel of the scene. Several algorithms have been proposed in the recent literature to automatically identify endmembers, even if the original hyperspectral scene does not contain any pure signatures. A popular strategy for endmember identification in highly mixed hyperspectral scenes has been the minimum volume simplex analysis (MVSA), known to be a computationally very expensive algorithm. This algorithm calculates the minimum volume enclosing simplex, as opposed to other algorithms that perform maximum simplex volume analysis (MSVA). The high computational complexity of MVSA, together with its very high memory requirements, has limited its adoption in the hyperspectral imaging community. In this paper we develop several optimizations to the MVSA algorithm. The main computational task of MVSA is the solution of a quadratic optimization problem with equality and inequality constraints, with the inequality constraints being in the order of the number of pixels multiplied by the number of endmembers. As a result, storing and computing the inequality constraint matrix is highly inefficient. The first optimization presented in this paper uses algebra operations in order to reduce the memory requirements of the algorithm. In the second optimization, we use graphics processing units (GPUs) to effectively solve (in parallel) the quadratic optimization problem involved in the computation of MVSA. In the third optimization, we extend the single GPU implementation to a multi-GPU one, developing a hybrid strategy that distributes the computation while taking advantage of GPU accelerators at each node. The presented optimizations are tested in different analysis scenarios (using both synthetic and real hyperspectral data) and shown to provide state-of-the-art results from the viewpoint of unmixing accuracy and computational performance. The speedup achieved using the full GPU cluster compared to the CPU implementation in tenfold in a real hyperspectral image.

(A. Agathos, J. Li, D. Petcu and A. Plaza: *“Multi-GPU Implementation of the Minimum Volume Simplex Analysis Algorithm for Hyperspectral Unmixing”*. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, accepted for publication , 2014. [PDF] )

## New rCUDA 4.1 version available

March 26th, 2014A new version of the rCUDA middleware has been released (version 4.1). In addition to fix some bugs related with asynchronous memory transfers, the new release provides support for:

- CUDA 5.5 Runtime API
- Mellanox Connect-IB network adapters
- Dynamic Parallelism
- cuFFT and cuBLAS libraries

The rCUDA middleware allows to seamlessly use, within your cluster, GPUs that are installed in computing nodes different from the one that is executing the CUDA application, without requiring to modify nor recompile your program. Please visit www.rcuda.net for more details about the rCUDA technology.

## High-Performance Image Synthesis for Radio Interferometry

March 26th, 2014Abstract:

A radio interferometer indirectly measures the intensity distribution of the sky over the celestial sphere. Since measurements are made over an irregularly sampled Fourier plane, synthesising an intensity image from interferometric measurements requires substantial processing. Furthermore there are distortions that have to be corrected. In this thesis, a new high-performance image synthesis tool (imaging tool) for radio interferometry is developed. Implemented in C++ and CUDA, the imaging tool achieves unprecedented performance by means of Graphics Processing Units (GPUs). The imaging tool is divided into several components, and the back-end handling numerical calculations is generalised in a new framework. A new feature termed compression arbitrarily increases the performance of an already highly efficient GPU-based implementation of the w-projection algorithm. Compression takes advantage of the behaviour of oversampled convolution functions and the baseline trajectories. A CPU-based component prepares data for the GPU which is multi-threaded to ensure maximum use of modern multi-core CPUs. Best performance can only be achieved if all hardware components in a system do work in parallel. The imaging tool is designed such that disk I/O and work on CPU and GPUs is done concurrently. Test cases show that the imaging tool performs nearly 100× faster than another general CPU-based imaging tool. Unfortunately, the tool is limited in use since deconvolution and A-projection are not yet supported. It is also limited by GPU memory. Future work will implement deconvolution and A-projection, whilst finding ways of overcoming the memory limitation.

(Daniel Muscat: “High-Performance Image Synthesis for Radio Interferometry”. Preprint, 2014. [arXiv])

## cuTauLeaping: A GPU-Powered Tau-Leaping Stochastic Simulator for Massive Parallel Analyses of Biological Systems

March 26th, 2014Abstract:

Tau-leaping is a stochastic simulation algorithm that efficiently reconstructs the temporal evolution of biological systems, modeled according to the stochastic formulation of chemical kinetics. The analysis of dynamical properties of these systems in physiological and perturbed conditions usually requires the execution of a large number of simulations, leading to high computational costs. Since each simulation can be executed independently from the others, a massive parallelization of tau-leaping can bring to relevant reductions of the overall running time. The emerging field of General Purpose Graphic Processing Units (GPGPU) provides power-efficient high-performance computing at a relatively low cost. In this work we introduce cuTauLeaping, a stochastic simulator of biological systems that makes use of GPGPU computing to execute multiple parallel tau-leaping simulations, by fully exploiting the Nvidia’s Fermi GPU architecture. We show how a considerable computational speedup is achieved on GPU by partitioning the execution of tau-leaping into multiple separated phases, and we describe how to avoid some implementation pitfalls related to the scarcity of memory resources on the GPU streaming multiprocessors. Our results show that cuTauLeaping largely outperforms the CPU-based tau-leaping implementation when the number of parallel simulations increases, with a break-even directly depending on the size of the biological system and on the complexity of its emergent dynamics. In particular, cuTauLeaping is exploited to investigate the probability distribution of bistable states in the Schlögl model, and to carry out a bidimensional parameter sweep analysis to study the oscillatory regimes in the Ras/cAMP/PKA pathway in S. cerevisiae.

(Nobile M.S., Cazzaniga P., Besozzi D., Pescini D., Mauri G.: *“cuTauLeaping: A GPU-Powered Tau-Leaping Stochastic Simulator for Massive Parallel Analyses of Biological Systems”*. PLoS ONE 9(3): e91963. [DOI])