We present MGPU, a C++ programming library targeted at single-node multi-GPU systems. Such systems combine disproportionate floating point performance with high data locality and are thus well suited to implement real-time algorithms. We describe the library design, programming interface and implementation details in light of this specific problem domain. The core concepts of this work are a novel kind of container abstraction and MPI-like communication methods for intra-system communication. We further demonstrate how MGPU is used as a framework for porting existing GPU libraries to multi-device architectures. Putting our library to the test, we accelerate an iterative non-linear image reconstruction algorithm for real-time magnetic resonance imaging using multiple GPUs. We achieve a speed-up of about 1.7 using 2 GPUs and reach a final speed-up of 2.1 with 4 GPUs. These promising results lead us to conclude that multi-GPU systems are a viable solution for real-time MRI reconstruction as well as signal-processing applications in general.
(Sebastian Schaetz and Martin Uecker: “A Multi-GPU Programming Library for Real-Time Applications”, Algorithms and Architectures for Parallel Processing (2012): 114-128. [DOI] [ARXIV])
This article proposes to address, in a tutorial style, the benefits of using Open Computing Language (OpenCL) as a quick way to allow programmers to express and exploit parallelism in signal processing algorithms, such as those used in error-correcting code systems. In particular, we will show how multiplatform kernels can be developed straightforwardly using OpenCL to perform computationally intensive low-density parity-check (LDPC) decoding, targeting them to run on a large set of worldwide disseminated multicore architectures, such as x86 general- purpose multicore central processing units (CPUs) and graphics processing units (GPUs). Moreover, devices with different architectures can be orchestrated to cooperatively execute these signal processing applications programmed in OpenCL. Experimental evaluation of the parallel kernels programmed with the OpenCL framework shows that high-performance can be achieved for distinct parallel computing architectures with low programming effort.
The complete source code developed and instructions for compiling and executing the program are available at http://www.co.it.pt/ldpcopencl for signal processing programmers who wish to engage with more advanced features supported by OpenCL.
(G. Falcao, V. Silva, L. Sousa and J. Andrade: “Portable LDPC Decoding on Multicores Using OpenCL [Applications Corner]“, IEEE Signal Processing Magazine 29:4(81-109), July 2012. [DOI])
Abstract: “The widespread usage of the Discrete Wavelet Transform (DWT) has motivated the development of fast DWT algorithms and their tuning on all sorts of computer systems. Several studies have compared the performance of the most popular schemes, known as Filter Bank (FBS) and Lifting (LS), and have always concluded that Lifting is the most efficient option. However, there is no such study on streaming processors such as modern Graphic Processing Units (GPUs). Current trends have transformed these devices into powerful stream processors with enough flexibility to perform intensive and complex floating-point calculations. The opportunities opened up by these platforms, as well as the growing popularity of the DWT within the computer graphics field, make a new performance comparison of great practical interest. Our study indicates that FBS outperforms LS in current generation GPUs. In our experiments, the actual FBS gains range between 10% and 140%, depending on the problem size and the type and length of the wavelet filter. Moreover, design trends suggest higher gains in future generation GPUs. (Parallel Implementation of the 2D Discrete Wavelet Transform on Graphics Processing Units: Filter Bank versus Lifting. Christian Tenllado, Javier Setoain, Manuel Prieto, Luis PiÃ±uel, and Francisco Tirado. IEEE Transactions on Parallel and Distributed Systems ,vol. 19, no. 3, pp. 299-310, March, 2008. )
This paper by Govindaraju et al. describes a high-performance FFT algorithm on GPUs. The algorithm is highly tuned for GPUs using memory optimizations. It further improves performance using pipelining strategies. In practice, it is able to achieve 4x higher computational performance on a $500 NVIDIA GPU than optimized single precision FFT algorithms on high-end CPUs costing $1500. (“Efficient memory model for scientific algorithms on graphics processors”, Naga Govindaraju, Scott Larsen, Jim Gray and Dinesh Manocha, UNC Tech. Report 2006)
Alexey Smirnov and Tzi-cker Chiueh from Stony Brook University have published a technical report describing an implementation of a FIR filter on a GPU. The results of the performance evaluation using a Geforce 6600 video card and a Pentium 4-HT 3.2 GHz-based PC indicate that the GPU implementation is better than the SSE-optimized CPU implementation for certain input parameters. (FIR on GPU project. Report: An Implementation of a FIR Filter on a GPU (warning: postscript). Technical Report, Experimental Computer Systems Lab, Stony Brook University, 2005.)
The latest versions of Cycling ’74s MAX/MSP/Jitter software packages provide a visual programming environment for new media with applications in GPU based stream processing, real-time video processing, volume visualization, and generic n-dimensional data analysis and signal processing. Jitter supports cascaded GLSL/Cg/ARB/NV shader programs with a streamlined render-to-texture interface, allowing fast prototyping of complex shader effects to be processed in a generic data flow network. (Jitter v1.5 Upgrade Info. Cycling ’74.)
From the abstract: In recent years, the development of programmable graphics pipelines has placed the power of parallel computation in the hands of consumers. Systems developers are now paying attention to the general purpose computational ability of these graphics processor units, or GPUs, and are using them in novel ways. This paper examines using pixel shaders for executing audio algorithms. We compare GPU performance to CPU performance, discuss problems encountered, and suggest new directions for supporting the needs of the audio community. Source code is also available. (Audio and the Graphics Processing Unit”, by Sean Whalen)
This paper by Jansen et al. describes how to utilize current commodity graphics hardware to perform Fourier volume rendering directly on the GPU. The paper presents a novel implementation of the Fast Fourier Transform: This Split-Stream-FFT maps the recursive structure of the FFT to the GPU in an efficient way. Additionally, high-quality resampling within the frequency domain is discussed. The implementation enables visualization of large volumetric data sets at interactive frame rates on a mid-range computer system. (Fourier Volume Rendering on the GPU Using a Split-Stream FFT)
This website presents a fast GPU algorithm to perform the discrete wavelet transform featuring flexible boundary extension schemes, flexible wavelet kernels, Cg shader implementation, and high precision. The algorithm was developed by the Graphics Team at The Chinese University of Hong Kong. The beauty of the method is that both forward and inverse wavelet transforms are unified using position-dependent filtering and convolution and an indirect addressing technique. The software is open source and free for any commercial or academic use, and is currently available both as an unofficial GPU extension to the Jasper JPEG2000 software and as a standalone DWTGPU C++ class with a demo program. (Jianqing Wang, Tien-Tsin Wong, Pheng-Ann Heng and Chi-Sing Leung. The Discrete Wavelet Transform on a GPU.)