Panoptes: A Binary Translation Framework for CUDA

May 22nd, 2012

Traditional CPU-based computing environments offer a variety of binary instrumentation frameworks. Instrumentation and analysis tools for GPU environments to date have been more limited. Panoptes is a binary instrumentation framework for CUDA that targets the GPU. By exploiting the GPU to run modified kernels, computationally-intensive programs can be run at the native parallelism of the device during analysis. To demonstrate its instrumentation capabilities, we currently implement a memory addressability and validity checker that targets CUDA programs.

Panoptes traces targeted programs by library interposition at runtime. Read the rest of this entry »

Benchmarking Analytical Queries on a GPU

May 20th, 2012

This report describes advantages of using GPUs for analytical queries. It compares performance of the Alenka database engine using a GPU with the performance of Oracle on a  SPARC server. More information on Alenka including source code:

CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform

May 11th, 2012


Motivation: New high-throughput sequencing technologies have promoted the production of short reads with dramatically low unit cost. The explosive growth of short read datasets poses a challenge to the mapping of short reads to reference genomes, such as the human genome, in terms of alignment quality and execution speed.

Results: We present CUSHAW, a parallelized short read aligner based on the compute unified device architecture (CUDA) parallel programming model. We exploit CUDA-compatible graphics hardware as accelerators to achieve fast speed. Our algorithm employs a quality-aware bounded search approach based on the Burrows- Wheeler transform (BWT) and the Ferragina Manzini (FM)-index to reduce the search space and achieve high alignment quality. Performance evaluation, using simulated as well as real short read datasets, reveals that our algorithm running on one or two graphics processing units (GPUs) achieves significant speedups in terms of execution time, while yielding comparable or even better alignment quality for paired-end alignments compared to three popular BWT-based aligners: Bowtie, BWA and SOAP2. CUSHAW also delivers competitive performance in terms of SNP calling for an E.coli test dataset.


(Y. Liu, B. Schmidt, D. Maskell: “CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform”, Bioinformatics, 2012. [DOI])

New rCUDA version beta testing

April 18th, 2012

The rCUDA Team is proud to announce a new version of the rCUDA framework which will include many new functionalities as well as boosted performance. This new version, cooked for over a year, will incorporate pipelined transfers, full multi-thread and multi-node capabilities, CUDA 4.1 support, global scheduler integration, support for CUDA C extensions, and native InfiniBand support. A closed beta teting program has been started. See the complete text at

Scalable GPU graph traversal

April 17th, 2012


Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter.

We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations both CPU and GPU platforms.

(Duane Merrill, Michael Garland and  Andrew Grimshaw: “Scalable GPU graph traversal”, Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming (PPoPP’12), pp.117-128, Feburary 2012. [DOI])

CFP: Deadline Extension – UKPEW 2012 – The 28th UK Performance Engineering Workshop

April 10th, 2012

UKPEW is the leading UK forum for the presentation of all aspects of performance modeling and analysis of computer and telecommunication systems. Original papers are invited on all relevant topics but papers on or related to the subjects listed below are particularly welcome.

The paper submission deadline has just been extended to April 20, 2012. The conference takes place June 2 and 3, 2012, in Edinburgh, UK. More Information:

Accelerate Your Science on the Titan Supercomputer

April 1st, 2012

Accelerate your science on the Titan Supercomputer later this year, by harnessing up to 20 petaflops of parallel processing using GPUs. Open to researchers from academia, government labs, and industry, the Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program is the major means by which the scientific community gains access to some of the fastest supercomputers.

First, let INCITE know you are interested in GPU acceleration by completing a two-minute survey. Then determine if you want to submit a formal proposal by June 27, 2012.

Need help drafting your proposal? Attend a “how-to” webinar on Tuesday, April 24 to learn tips and tricks for drafting your proposal. For further questions about the call for proposals, please contact the INCITE manager at

Adaptive Row-Grouped CSR Format For Storing of Sparse Matrices on GPU

April 1st, 2012


We present a new adaptive format for storing sparse matrices on GPU. We compare it with several other formats including CUSPARSE which is today probably the best choice for processing of sparse matrices on GPU in CUDA. Contrary to CUSPARSE which works with common CSR format, our new format requires conversion. However, multiplication of sparse-matrix and vector is significantly faster for many matrices. We demonstrate it on a set of 1600 matrices and we show for what types of matrices our format is profitable.

(Heller M., Oberhuber T., “Adaptive Row-Grouped CSR Format For Storing of Sparse Matrices on GPU“, preprint on 2012, [PDF])

Image segmentation using CUDA implementations of the Runge-Kutta-Merson and GMRES methods

March 18th, 2012


Modern GPUs are well suited for performing image processing tasks. We utilize their high computational performance and memory bandwidth for image segmentation purposes. We segment cardiac MRI data by means of numerical solution of an anisotropic partial differential equation of the Allen-Cahn type. We implement two different algorithms for solving the equation on the CUDA architecture. One of them is based on the Runge-Kutta-Merson method for the approximation of solutions of ordinary differential equations, the other uses the GMRES method for the numerical solution of systems of linear equations. In our experiments, the CUDA implementations of both algorithms are about 3–9 times faster than corresponding 12-threaded OpenMP implementations.

(Oberhuber T., Suzuki A., Vacata J., Žabka V., “Image segmentation using CUDA implementations of the Runge-Kutta-Merson and GMRES methods“, Journal of Math-for-Industry, 2011, vol. 3, pp. 73–79 [PDF])

Wall Orientation and Shear Stress in the Lattice Boltzmann Model

March 16th, 2012


The wall shear stress is a quantity of profound importance for clinical diagnosis of artery diseases. The lattice Boltzmann is an easily parallelizable numerical method of solving the flow problems, but it suffers from errors of the velocity field near the boundaries which leads to errors in the wall shear stress and normal vectors computed from the velocity. In this work we present a simple formula to calculate the wall shear stress in the lattice Boltzmann model and propose to compute wall normals, which are necessary to compute the wall shear stress, by taking the weighted mean over boundary facets lying in a vicinity of a wall element. We carry out several tests and observe an increase of accuracy of computed normal vectors over other methods in two and three dimensions. Using the scheme we compute the wall shear stress in an inclined and bent channel fluid flow and show a minor influence of the normal on the numerical error, implying that that the main error arises due to a corrupted velocity field near the staircase boundary. Finally, we calculate the wall shear stress in the human abdominal aorta in steady conditions using our method and compare the results with a standard finite volume solver and experimental data available in the literature. Applications of our ideas in a simplified protocol for data preprocessing in medical applications are discussed.

(Maciej Matyka, Zbigniew Koza, Łukasz Mirosław: “Wall Orientation and Shear Stress in the Lattice Boltzmann Model”, Preprint, 2012. [arXiv])

Page 9 of 57« First...7891011...203040...Last »