Efficient Multi-GPU Computation of All-Pairs Shortest Paths

April 2nd, 2014


We describe a new algorithm for solving the all-pairs shortest-path (APSP) problem for planar graphs and graphs with small separators that exploits the massive on-chip parallelism available in today’s Graphics Processing Units (GPUs). Our algorithm, based on the Floyd-Warshall algorithm, has near optimal complexity in terms of the total number of operations, while its matrix-based structure is regular enough to allow for efficient parallel implementation on the GPUs. By applying a divide-and-conquer approach, we are able to make use of multi-node GPU clusters, resulting in more than an order of magnitude speedup over the fastest known Dijkstra-based GPU implementation and a two-fold speedup over a parallel Dijkstra-based CPU implementation.

(Hristo Djidjev, Sunil Thulasidasan, Guillaume Chapuis, Rumen Andonov and Dominique Lavenier: “Efficient Multi-GPU Computation of All-Pairs Shortest Paths”. To appear in IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2014. [PDF])

High-Performance Image Synthesis for Radio Interferometry

March 26th, 2014


A radio interferometer indirectly measures the intensity distribution of the sky over the celestial sphere. Since measurements are made over an irregularly sampled Fourier plane, synthesising an intensity image from interferometric measurements requires substantial processing. Furthermore there are distortions that have to be corrected. In this thesis, a new high-performance image synthesis tool (imaging tool) for radio interferometry is developed. Implemented in C++ and CUDA, the imaging tool achieves unprecedented performance by means of Graphics Processing Units (GPUs). The imaging tool is divided into several components, and the back-end handling numerical calculations is generalised in a new framework. A new feature termed compression arbitrarily increases the performance of an already highly efficient GPU-based implementation of the w-projection algorithm. Compression takes advantage of the behaviour of oversampled convolution functions and the baseline trajectories. A CPU-based component prepares data for the GPU which is multi-threaded to ensure maximum use of modern multi-core CPUs. Best performance can only be achieved if all hardware components in a system do work in parallel. The imaging tool is designed such that disk I/O and work on CPU and GPUs is done concurrently. Test cases show that the imaging tool performs nearly 100× faster than another general CPU-based imaging tool. Unfortunately, the tool is limited in use since deconvolution and A-projection are not yet supported. It is also limited by GPU memory. Future work will implement deconvolution and A-projection, whilst finding ways of overcoming the memory limitation.

(Daniel Muscat: “High-Performance Image Synthesis for Radio Interferometry”. Preprint, 2014. [arXiv])

cuTauLeaping: A GPU-Powered Tau-Leaping Stochastic Simulator for Massive Parallel Analyses of Biological Systems

March 26th, 2014


Tau-leaping is a stochastic simulation algorithm that efficiently reconstructs the temporal evolution of biological systems, modeled according to the stochastic formulation of chemical kinetics. The analysis of dynamical properties of these systems in physiological and perturbed conditions usually requires the execution of a large number of simulations, leading to high computational costs. Since each simulation can be executed independently from the others, a massive parallelization of tau-leaping can bring to relevant reductions of the overall running time. The emerging field of General Purpose Graphic Processing Units (GPGPU) provides power-efficient high-performance computing at a relatively low cost. In this work we introduce cuTauLeaping, a stochastic simulator of biological systems that makes use of GPGPU computing to execute multiple parallel tau-leaping simulations, by fully exploiting the Nvidia’s Fermi GPU architecture. We show how a considerable computational speedup is achieved on GPU by partitioning the execution of tau-leaping into multiple separated phases, and we describe how to avoid some implementation pitfalls related to the scarcity of memory resources on the GPU streaming multiprocessors. Our results show that cuTauLeaping largely outperforms the CPU-based tau-leaping implementation when the number of parallel simulations increases, with a break-even directly depending on the size of the biological system and on the complexity of its emergent dynamics. In particular, cuTauLeaping is exploited to investigate the probability distribution of bistable states in the Schlögl model, and to carry out a bidimensional parameter sweep analysis to study the oscillatory regimes in the Ras/cAMP/PKA pathway in S. cerevisiae.

(Nobile M.S., Cazzaniga P., Besozzi D., Pescini D., Mauri G.: “cuTauLeaping: A GPU-Powered Tau-Leaping Stochastic Simulator for Massive Parallel Analyses of Biological Systems”. PLoS ONE 9(3): e91963. [DOI])

GPU-Accelerated Analysis and Visualization of Large Structures Solved by Molecular Dynamics Flexible Fitting

March 26th, 2014


Hybrid structure fitting methods combine data from cryo-electron microscopy and X-ray crystallography with molecular dynamics simulations for the determination of all-atom structures of large biomolecular complexes. Evaluating the quality-of-fit obtained from hybrid fitting is computationally demanding, particularly in the context of a multiplicity of structural conformations that must be evaluated. Existing tools for quality-of-fit analysis and visualization have previously targeted small structures and are too slow to be used interactively for large biomolecular complexes of particular interest today such as viruses or for long molecular dynamics trajectories as they arise in protein folding. We present new data-parallel and GPU-accelerated algorithms for rapid interactive computation of quality-of-fit metrics linking all-atom structures and molecular dynamics trajectories to experimentally-determined density maps obtained from cryo-electron microscopy or X-ray crystallography. We evaluate the performance and accuracy of the new quality-of-fit analysis algorithms vis-a-vis existing tools, examine algorithm performance on GPU-accelerated desktop workstations and supercomputers, and describe new visualization techniques for results of hybrid structure fitting methods.

(John E. Stone, Ryan McGreevy, Barry Isralewitz, and Klaus Schulten: “GPU-Accelerated Analysis and Visualization of Large Structures Solved by Molecular Dynamics Flexible Fitting”. Faraday Discussion 169, 2014. [DOI])

Efficient Acceleration of Mutual Information Computation for Nonrigid Registration Using CUDA

March 19th, 2014


In this paper, we propose an efficient acceleration method for the nonrigid registration of multimodal images that uses a graphics processing unit (GPU). The key contribution of our method is efficient utilization of on-chip memory for both normalized mutual information (NMI) computation and hierarchical B-spline deformation, which compose a well-known registration algorithm. We implement this registration algorithm as a compute unified device architecture (CUDA) program with an efficient parallel scheme and several optimization techniques such as hierarchical data organization, data reuse, and multiresolution representation. We experimentally evaluate our method with four clinical datasets consisting of up to 512x512x296 voxels. We find that exploitation of onchip memory achieves a 12-fold increase in speed over an off-chip memory version and, therefore, it increases the efficiency of parallel execution from 4% to 46%. We also find that our method running on a GeForce GTX 580 card is approximately 14 times faster than a fully optimized CPU-based implementation running on four cores. Some multimodal registration results are also provided to understand the limitation of our method. We believe that our highly efficient method, which completes an alignment task within a few tens of second, will be useful to realize rapid nonrigid registration.

(Kei Ikeda, Fumihiko Ino, and Kenichi Hagihara: “Efficient Acceleration of Mutual Information Computation for Nonrigid Registration Using CUDA”. Accepted for publication in the IEEE Journal of Biomedical and Health Informatics. [DOI])

A Detailed GPU Cache Model Based on Reuse Distance Theory

March 5th, 2014


As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality systematically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: 1) the GPU’s hierarchy of threads, warps, threadblocks, and sets of active threads, 2) conditional and non-uniform latencies, 3) cache associativity, 4) miss-status holding-registers, and 5) warp divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.

(Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, Henri Bal: “A Detailed GPU Cache Model Based on Reuse Distance Theory”, in High Performance Computer Architecture (HPCA), 2014, [PDF])

GPU-Accelerated Molecular Visualization on Petascale Supercomputing Platforms

March 5th, 2014


Petascale supercomputers create new opportunities for the study of the structure and function of large biomolecular complexes such as viruses and photosynthetic organelles, permitting all-atom molecular dynamics simulations of tens to hundreds of millions of atoms. Together with simulation and analysis, visualization provides researchers with a powerful “computational microscope”. Petascale molecular dynamics simulations produce tens to hundreds of terabytes of data that can be impractical to transfer to remote facilities, making it necessary to perform visualization and analysis tasks in-place on the supercomputer where the data are generated. We describe the adaptation of key visualization features of VMD, a widely used molecular visualization and analysis tool, for GPU-accelerated petascale computers. We discuss early experiences adapting ray tracing algorithms for GPUs, and compare rendering performance for recent petascale molecular simulation test cases on Cray XE6 (CPU-only) and XK7 (GPU-accelerated) compute nodes. Finally, we highlight opportunities for further algorithmic improvements and optimizations.

(John E. Stone, Kirby L. Vandivort, and Klaus Schulten: “GPU-Accelerated Molecular Visualization on Petascale Supercomputing Platforms”. UltraVis’13: Proceedings of the 8th International Workshop on Ultrascale Visualization, pp. 6:1-6:8, 2013. [DOI])

CfP: High-Performance Graphics

February 17th, 2014

High-Performance Graphics is the leading international forum for performance-oriented graphics and imaging systems research including innovative algorithms, efficient implementations, languages, parallelism, compilers, parallelism, hardware and architectures for high-performance graphics. High-Performance Graphics was founded in 2009, synthesizing multiple conferences to bring together researchers, engineers, and architects to discuss the complex interactions of parallel hardware, novel programming models, and efficient algorithms in the design of systems for current and future graphics and visual computing applications.

HPC is co-located with EGSR, in Lyon, France, June 23-25, 2014. More information: http://www.highperformancegraphics.org/2014

CfP: ADBIS workshop on GPUs In Databases GID 2014

February 17th, 2014

High performance of modern Graphics Processing Units may be utilized not only for graphics related application but also for general computing. This computing power has been utilized in new variants of many algorithms from almost every computer science domain. Unfortunately, while other application domains strongly benefit from utilizing the GPUs, databases related applications seem not to get enough attention. The main goal of GPUs in Databases workshop is to fill this gap. This event is devoted to sharing the knowledge related to applying GPUs in Database environments and to discuss possible future development of this application domain.

ADBIS workshop on GPUs In Databases GID 2014, September 7th, 2014, Ohrid, Republic of Macedonia. More information: http://gid.us.to

OpenCLIPP: an OpenCL library for optimized image processing primitives

February 2nd, 2014

OpenCLIPP is a library providing processing primitives (image processing primitives in the first version) implemented with OpenCL for fast execution on dedicated computing devices like GPUs. Two interfaces are provided: C (similar to the Intel IPP and NVIDIA NPP libraries) and C++. OpenCLIPP is free for personal and commercial use. It can be downloaded from GitHub.

Related publication:
M. Akhloufi, A. Campagna, “OpenCLIPP: OpenCL Integrated Performance Primitives library for computer vision applications”, Proc. SPIE Electronic Imaging 2014, Intelligent Robots and Computer Vision XXXI: Algorithms and Techniques, P. 9025-31, February 2014.

Page 1 of 5512345...102030...Last »