CfP: Numerical Computations with GPUs

September 22nd, 2013

Numerical Computations with GPUs, to be published by Springer, will contain a collection of articles on core numerical methods adapted for Graphics Processing Units (GPUs). Classical numerical methods (solution of linear equations, FFT, etc.) are at the core of many scientific and engineering computations. In recent years substantial efforts were undertaken to adapt these methods for recently emerged GPU-based systems. The book is envisioned as a consolidation of such work into a single volume covering widely used methods and techniques. Each chapter will provide mathematical background, parallel algorithm, and implementation details leading to reusable, adaptable, and scalable code fragments. Each chapter will be accompanied with a basic CUDA or OpenCL source code that can be used by the readers as a starting point for adaptation in their applications. The book will serve as a GPU implementation manual for many numerical algorithms providing valuable insights into parallelization strategies for GPUs as well as ready-to-use code fragments with a broad appeal to both developers and researchers interested in GPU computing.

Authors interested in contributing to this volume are asked to submit a short proposal via EasyChair ( by October 15, 2013. Authors of the accepted/invited chapters are expected to write and submit to the editor completed chapters by January 31, 2014. For more details see full solicitation ( or contact the Editor at

Workload Analysis and Efficient OpenCL-based Implementation of SIFT Algorithm on a Smartphone

September 22nd, 2013


Feature detection and extraction are essential in computer vision applications such as image matching and object recognition. The Scale-Invariant Feature Transform (SIFT) algorithm is one of the most robust approaches to detect and extract distinctive invariant features from images. However, high computational complexity makes it difficult to apply the SIFT algorithm to mobile applications. Recent developments in mobile processors have enabled heterogeneous computing on mobile devices, such as smartphones and tablets. In this paper, we present an OpenCL-based implementation of the SIFT algorithm on a smartphone, taking advantage of the mobile GPU. We carefully analyze the SIFT workloads and identify the parallelism. We implemented major steps of the SIFT algorithm using both serial C++ code and OpenCL kernels targeting mobile processors, to compare the performance of different workflows. Based on the profiling results, we partition the SIFT algorithm between the CPU and GPU in a way that best exploits the parallelism and minimizes the buffer transferring time to achieve better performance. The experimental results show that we are able to achieve 8.5 FPS for keypoints detection and 19 FPS for descriptor generation without reducing the number and the quality of the keypoints. Moreover, the heterogeneous implementation can reduce energy consumption by 41% compared to an optimized CPU-only implementation.

(Guohui Wang, Blaine Rister, and Joseph R. Cavallaro: “Workload Analysis and Efficient OpenCL-based Implementation of SIFT Algorithm on a Smartphone”, 1st IEEE Global Conference on Signal and Information Processing (GlobalSIP), Dec. 2013, [PDF])

High Throughput Low Latency LDPC Decoding on GPU for SDR Systems

September 22nd, 2013


In this paper, we present a high throughput and low latency LDPC (low-density parity-check) decoder implementation on GPUs (graphics processing units). The existing GPU-based LDPC decoder implementations suffer from low throughput and long latency, which prevent them from being used in practical SDR (software-defined radio) systems. To overcome this problem, we present optimization techniques for a parallel LDPC decoder including algorithm optimization, fully coalesced memory access, asynchronous data transfer and multi-stream concurrent kernel execution for modern GPU architectures. Experimental results demonstrate that the proposed LDPC decoder achieves 316Mbps (at 10 iterations) peak throughput on a single GPU. The decoding latency, which is much lower than that of the state of the art, varies from 0.207ms to 1.266ms for different throughput requirements from 62.5Mbps to 304.16Mbps. When using four GPUs concurrently, we achieve an aggregate peak throughput of 1.25Gbps (at 10 iterations).

(Guohui Wang, Michael Wu, Bei Yin, and Joseph R. Cavallaro: “High Throughput Low Latency LDPC Decoding on GPU for SDR Systems”, 1st IEEE Global Conference on Signal and Information Processing (GlobalSIP), Dec. 2013. [PDF])

Fast JPEG codec from Fastvideo

September 22nd, 2013

Fastvideo have released their JPEG codec for NVIDIA GPUs. Peak performance of the codec reaches 6 GBytes per second and higher for images loadedfrom host RAM. For instance, a full-color 4K image with resolution 3840 x 2160 can be compressed by 10 times in merely 6 milliseconds on NVIDIA GeForce GTX Titan. More information:

PPAM 2013 CUDA Course Notes

September 9th, 2013

All course material from the full-day CUDA tutorial at PPAM 2013 are now available at The tutorial was held on Sunday, Sep. 8 2013 in Warsaw, Poland.

CfP: 22nd High Performance Computing Symposium

September 9th, 2013

The 22nd High Performance Computing Symposium (April 13-16, 2014 in Tampa, FL, USA) is devoted to the impact of high performance computing and communications on computer simulations. Topics of interest include:

  • GPU for general purpose computations (GPGPU)
  • Hybrid system modeling and simulation
  • Tools and environments for coupling parallel codes
  • Parallel algorithms and architectures
  • High performance software tools

Submission deadline for full papers: October 25, 2013. More information can be found at

New CUDA Technical Blog: Constant Cache vs. Read-only Cache

September 9th, 2013

This blog takes a closer look at constant cache and read-only cache. It highlights the differences between the two memory types and what circumstances they perform best in. Read the whole story here.

A unified sparse matrix data format for modern processors with wide SIMD units

September 4th, 2013


Sparse matrix-vector multiplication (spMVM) is the most time-consuming kernel in many numerical algorithms and has been studied extensively on all modern processor and accelerator architectures. However, the optimal sparse matrix data storage format is highly hardware-specific, which could become an obstacle when using heterogeneous systems. Also, it is as yet unclear how the wide single instruction multiple data (SIMD) units in current multi- and many-core processors should be used most efficiently if there is no structure in the sparsity pattern of the matrix. We suggest SELL-C-sigma, a variant of Sliced ELLPACK, as a SIMD-friendly data format which combines long-standing ideas from General Purpose Graphics Processing Units (GPGPUs) and vector computer programming. We discuss the advantages of SELL-C-sigma compared to established formats like Compressed Row Storage (CRS) and ELLPACK, and show its suitability on a variety of hardware platforms (Intel Sandy Bridge, Intel Xeon Phi and Nvidia Tesla K20) for a wide range of test matrices from different application areas. Using appropriate performance models we develop deep insight into the data transfer properties of the SELL-C-sigma spMVM kernel. SELL-C-sigma comes with two tuning parameters whose performance impact across the range of test matrices is studied and for which reasonable choices are proposed. This leads to a hardware-independent (“catch-all”) sparse matrix format, which achieves very high efficiency for all test matrices across all hardware platforms.

(M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop: “A unified sparse matrix data format for modern processors with wide SIMD units.” Submitted, July 2013 [preprint])

Acceleware offers upcoming CUDA and C++ AMP training events

September 2nd, 2013

Acceleware, partnering with NVIDIA, if offering a four-day CUDA training course designed for programmers in the oil and gas industry who are looking to develop comprehensive skills in writing and optimizing applications that fully leverage the many-core processing capabilities of the GPU. Commonly used algorithms in oil and gas such as filtering and FFTs will be used and profiled in the examples. The case study on day 4 will focus on efficient implementation of 3D convolution, which is highly applicable to reverse time migration. A background in oil and gas is not necessary. Register here.

Acceleware training program manager Kelly Goss is hosting a 4 day C++ AMP training course in Boston, MA September 10 -13. The course will cover an overview of GPU computing, C++ 11 Lambda syntax, data-parallel architectures and the C++ AMP programming model. Every day consists of lectures and hands-on exercises (laptops for student’s use provided for the duration of the course).  Click here for full details and registration.

Towards Performance-Portable, Scalable and Convenient Linear Algebra

August 16th, 2013


The rise of multi- and many-core architectures also gave birth to a plethora of new parallel programming models. Among these, the open industry standard OpenCL addresses this heterogeneity of programming environments by providing a unified programming framework. The price to pay, however, is that OpenCL requires additional low-level boilerplate code, when compared to vendor-specific solutions, even if only simple operations are to be performed. Also, the unified programming framework does not automatically provide any guarantees on performance portability of a particular implementation. Thus, device-specific compute kernels are still required for obtaining good performance across different hardware architectures.
We address both, the issue of programmability and portable performance, in this work: On the one hand, a high-level programming interface for linear algebra routines allows for the convenient specification of the operations of interest without having to go into the details of the underlying hardware. On the other hand, we discuss the underlying generator for device-specific OpenCL kernels at runtime, which is supplemented by an auto-tuning framework for portable performance as well as with work partitioning and task scheduling for multiple devices. Our benchmark results show portable performance across hardware from major vendors. In all cases, at least 75 percent of the respective vendor tuned library was obtained, while in some cases we even outperformed the reference. We further demonstrate the convenient and efficient use of our high-level interface in a multi-device setting with good scalability.

(Philippe Tillet, Karl Rupp, Siegfried Selberherr, Chin-Teng Lin: “Towards Performance-Portable, Scalable, and Convenient Linear Algebra”. 5th USENIX Workshop on Hot Topics in Parallelism (HotPar’) 2013 [PDF].)

Page 10 of 109« First...89101112...203040...Last »