cupSODA: a CUDA-Powered Simulator of Mass-action Kinetics

October 19th, 2013


The computational investigation of a biological system often requires the execution of a large number of simulations to analyze its dynamics, and to derive useful knowledge on its behavior under physiological and perturbed conditions. This analysis usually turns out into very high computational costs when simulations are run on central processing units (CPUs), therefore demanding a shift to the use of high-performance processors. In this work we present a simulator of biological systems, called cupSODA, which exploits the higher memory bandwidth and computational capability of graphics processing units (GPUs). This software allows to execute parallel simulations of the dynamics of biological systems, by first deriving a set of ordinary differential equations from reaction-based mechanistic models defined according to the mass-action kinetics, and then exploiting the numerical integration algorithm LSODA. We show that cupSODA can achieve a 112× speedup on GPUs with respect to equivalent executions of LSODA on CPUs.

(Nobile M.S., Besozzi D., Cazzaniga P., Mauri G., Pescini D.: “cupSODA: a CUDA-Powered Simulator of Mass-action Kinetics”,  In 12th International Conference on Parallel Computing Technologies (PaCT), Lecture Notes in Computer Science, volume 7979, pp. 344-357, 2013. [DOI])

Performance benchmarks on CPU/GPU/Xeon Phi

October 19th, 2013

PARALUTION is a library for sparse iterative methods which can be performed on various parallel devices, including multi-core CPU and GPU. In the new 0.4.0 version, the library provides also a backend for Xeon Phi (MIC). With this new version, various performance benchmarks based on vector-vector routines, sparse matrix-vector multiplication and CG method on different backends have been released: OpenMP/CUDA/OpenCL- NVIDIA GPU, AMD GPU, CPU and Xeon Phi. More information:

Influence of InfiniBand FDR on the Performance of Remote GPU Virtualization

October 7th, 2013


The use of GPUs to accelerate general-purpose scientific and engineering applications is mainstream today, but their adoption in current high-performance computing clusters is impaired primarily by acquisition costs and power consumption. Therefore, the benefits of sharing a reduced number of GPUs among all the nodes of a cluster can be remarkable for many applications. This approach, usually referred to as remote GPU virtualization, aims at reducing the number of GPUs present in a cluster, while increasing their utilization rate. The performance of the interconnection network is key to achieving reasonable performance results by means of remote GPU virtualization. To this end, several networking technologies with throughput comparable to that of PCI Express have appeared recently. In this paper we analyze the influence of InfiniBand FDR on the performance of remote GPU virtualization, comparing its impact on a variety of GPU-accelerated applications with other networking technologies, such as InfiniBand QDR and Gigabit Ethernet. Given the severe limitations of freely available remote GPU virtualization solutions, the rCUDA framework is used as the case study for this analysis. Results show that the new FDR interconnect, featuring higher bandwidth than its predecessors, allows the reduction of the overhead of using GPUs remotely, thus making this approach even more appealing.

(Carlos Reano, Rafael Mayo, Enrique S. Quintana-Ortí, Federico Silla, José Duato and Antonio J. Pena: “Influence of InfiniBand FDR on the Performance of Remote GPU Virtualization”. Proceedings of the IEEE Cluster 2013 Conference, Indianapolis, USA, September 2013. [PDF])

A High Performance Copy-Move Image Forgery Detection Scheme on GPU

October 7th, 2013


This paper presents an accelerated version of copy-move image forgery detection scheme on the Graphics Processing Units or GPUs. With the replacement of analog cameras with their digital counterparts and availability of powerful image processing software packages, authentication of digital images has gained importance in the recent past. This paper focuses on improving the performance of a copy-move forgery detection scheme based on radix sort by porting it onto the GPUs. This scheme has enhanced performance and is much more efficient compared to other methods without degradation of detection results. The CPU version of the radix-sort based detection scheme was developed in Matlab and critical sections of the CPU version were coded in C-language using Matlab’s Mex interface to get the maximum performance. The GPU version was developed using Jacket GPU Engine for Matlab and performs over twelve times faster than its optimized CPU variant. The contribution this paper makes towards blind image forensics is the use of integral images for computing feature vectors of overlapping blocks in block-matching technique and acceleration of the entire copy-move forgery detection scheme on the GPUs, not found in literature.

(Jaideep Singh and Balasubramanian Raman, “A High Performance Copy-Move Image Forgery Detection Scheme on GPU”, Advances in Intelligent and Soft Computing Volume 131, 2012, pp 239-246, Proceedings of the International Conference on Soft Computing for Problem Solving (SocProS 2011). [DOI])

1000th GPGPU Post!

September 30th, 2013

The editors (Mark Harris and Dominik Goeddeke) were excited to notice that our previous post was the 1000th post on! GPGPU continues to grow as fast as ever, and we’re excited to see what it brings in the future.

We’d like to use this opportunity to remind you that relies on our readers to submit news. So please, visit and tell us about your project, publications, or the work of others you think we should share.

Thanks for an amazing 11 years and hundreds of news submissions!

CfP: Numerical Computations with GPUs

September 22nd, 2013

Numerical Computations with GPUs, to be published by Springer, will contain a collection of articles on core numerical methods adapted for Graphics Processing Units (GPUs). Classical numerical methods (solution of linear equations, FFT, etc.) are at the core of many scientific and engineering computations. In recent years substantial efforts were undertaken to adapt these methods for recently emerged GPU-based systems. The book is envisioned as a consolidation of such work into a single volume covering widely used methods and techniques. Each chapter will provide mathematical background, parallel algorithm, and implementation details leading to reusable, adaptable, and scalable code fragments. Each chapter will be accompanied with a basic CUDA or OpenCL source code that can be used by the readers as a starting point for adaptation in their applications. The book will serve as a GPU implementation manual for many numerical algorithms providing valuable insights into parallelization strategies for GPUs as well as ready-to-use code fragments with a broad appeal to both developers and researchers interested in GPU computing.

Authors interested in contributing to this volume are asked to submit a short proposal via EasyChair ( by October 15, 2013. Authors of the accepted/invited chapters are expected to write and submit to the editor completed chapters by January 31, 2014. For more details see full solicitation ( or contact the Editor at

Workload Analysis and Efficient OpenCL-based Implementation of SIFT Algorithm on a Smartphone

September 22nd, 2013


Feature detection and extraction are essential in computer vision applications such as image matching and object recognition. The Scale-Invariant Feature Transform (SIFT) algorithm is one of the most robust approaches to detect and extract distinctive invariant features from images. However, high computational complexity makes it difficult to apply the SIFT algorithm to mobile applications. Recent developments in mobile processors have enabled heterogeneous computing on mobile devices, such as smartphones and tablets. In this paper, we present an OpenCL-based implementation of the SIFT algorithm on a smartphone, taking advantage of the mobile GPU. We carefully analyze the SIFT workloads and identify the parallelism. We implemented major steps of the SIFT algorithm using both serial C++ code and OpenCL kernels targeting mobile processors, to compare the performance of different workflows. Based on the profiling results, we partition the SIFT algorithm between the CPU and GPU in a way that best exploits the parallelism and minimizes the buffer transferring time to achieve better performance. The experimental results show that we are able to achieve 8.5 FPS for keypoints detection and 19 FPS for descriptor generation without reducing the number and the quality of the keypoints. Moreover, the heterogeneous implementation can reduce energy consumption by 41% compared to an optimized CPU-only implementation.

(Guohui Wang, Blaine Rister, and Joseph R. Cavallaro: “Workload Analysis and Efficient OpenCL-based Implementation of SIFT Algorithm on a Smartphone”, 1st IEEE Global Conference on Signal and Information Processing (GlobalSIP), Dec. 2013, [PDF])

High Throughput Low Latency LDPC Decoding on GPU for SDR Systems

September 22nd, 2013


In this paper, we present a high throughput and low latency LDPC (low-density parity-check) decoder implementation on GPUs (graphics processing units). The existing GPU-based LDPC decoder implementations suffer from low throughput and long latency, which prevent them from being used in practical SDR (software-defined radio) systems. To overcome this problem, we present optimization techniques for a parallel LDPC decoder including algorithm optimization, fully coalesced memory access, asynchronous data transfer and multi-stream concurrent kernel execution for modern GPU architectures. Experimental results demonstrate that the proposed LDPC decoder achieves 316Mbps (at 10 iterations) peak throughput on a single GPU. The decoding latency, which is much lower than that of the state of the art, varies from 0.207ms to 1.266ms for different throughput requirements from 62.5Mbps to 304.16Mbps. When using four GPUs concurrently, we achieve an aggregate peak throughput of 1.25Gbps (at 10 iterations).

(Guohui Wang, Michael Wu, Bei Yin, and Joseph R. Cavallaro: “High Throughput Low Latency LDPC Decoding on GPU for SDR Systems”, 1st IEEE Global Conference on Signal and Information Processing (GlobalSIP), Dec. 2013. [PDF])

Fast JPEG codec from Fastvideo

September 22nd, 2013

Fastvideo have released their JPEG codec for NVIDIA GPUs. Peak performance of the codec reaches 6 GBytes per second and higher for images loadedfrom host RAM. For instance, a full-color 4K image with resolution 3840 x 2160 can be compressed by 10 times in merely 6 milliseconds on NVIDIA GeForce GTX Titan. More information:

PPAM 2013 CUDA Course Notes

September 9th, 2013

All course material from the full-day CUDA tutorial at PPAM 2013 are now available at The tutorial was held on Sunday, Sep. 8 2013 in Warsaw, Poland.

Page 12 of 112« First...1011121314...203040...Last »