Tau-leaping is a stochastic simulation algorithm that efficiently reconstructs the temporal evolution of biological systems, modeled according to the stochastic formulation of chemical kinetics. The analysis of dynamical properties of these systems in physiological and perturbed conditions usually requires the execution of a large number of simulations, leading to high computational costs. Since each simulation can be executed independently from the others, a massive parallelization of tau-leaping can bring to relevant reductions of the overall running time. The emerging field of General Purpose Graphic Processing Units (GPGPU) provides power-efficient high-performance computing at a relatively low cost. In this work we introduce cuTauLeaping, a stochastic simulator of biological systems that makes use of GPGPU computing to execute multiple parallel tau-leaping simulations, by fully exploiting the Nvidia’s Fermi GPU architecture. We show how a considerable computational speedup is achieved on GPU by partitioning the execution of tau-leaping into multiple separated phases, and we describe how to avoid some implementation pitfalls related to the scarcity of memory resources on the GPU streaming multiprocessors. Our results show that cuTauLeaping largely outperforms the CPU-based tau-leaping implementation when the number of parallel simulations increases, with a break-even directly depending on the size of the biological system and on the complexity of its emergent dynamics. In particular, cuTauLeaping is exploited to investigate the probability distribution of bistable states in the Schlögl model, and to carry out a bidimensional parameter sweep analysis to study the oscillatory regimes in the Ras/cAMP/PKA pathway in S. cerevisiae.
(Nobile M.S., Cazzaniga P., Besozzi D., Pescini D., Mauri G.: “cuTauLeaping: A GPU-Powered Tau-Leaping Stochastic Simulator for Massive Parallel Analyses of Biological Systems”. PLoS ONE 9(3): e91963. [DOI])
Hybrid structure fitting methods combine data from cryo-electron microscopy and X-ray crystallography with molecular dynamics simulations for the determination of all-atom structures of large biomolecular complexes. Evaluating the quality-of-fit obtained from hybrid fitting is computationally demanding, particularly in the context of a multiplicity of structural conformations that must be evaluated. Existing tools for quality-of-fit analysis and visualization have previously targeted small structures and are too slow to be used interactively for large biomolecular complexes of particular interest today such as viruses or for long molecular dynamics trajectories as they arise in protein folding. We present new data-parallel and GPU-accelerated algorithms for rapid interactive computation of quality-of-fit metrics linking all-atom structures and molecular dynamics trajectories to experimentally-determined density maps obtained from cryo-electron microscopy or X-ray crystallography. We evaluate the performance and accuracy of the new quality-of-fit analysis algorithms vis-a-vis existing tools, examine algorithm performance on GPU-accelerated desktop workstations and supercomputers, and describe new visualization techniques for results of hybrid structure fitting methods.
(John E. Stone, Ryan McGreevy, Barry Isralewitz, and Klaus Schulten: “GPU-Accelerated Analysis and Visualization of Large Structures Solved by Molecular Dynamics Flexible Fitting”. Faraday Discussion 169, 2014. [DOI])
This blog post explains GPU Boost, a new user controllable feature available on Tesla GPUs. Case studies and benchmarks for reverse time migration and an electromagnetic solver are discussed.
In this paper, we propose an efficient acceleration method for the nonrigid registration of multimodal images that uses a graphics processing unit (GPU). The key contribution of our method is efficient utilization of on-chip memory for both normalized mutual information (NMI) computation and hierarchical B-spline deformation, which compose a well-known registration algorithm. We implement this registration algorithm as a compute unified device architecture (CUDA) program with an efficient parallel scheme and several optimization techniques such as hierarchical data organization, data reuse, and multiresolution representation. We experimentally evaluate our method with four clinical datasets consisting of up to 512x512x296 voxels. We find that exploitation of onchip memory achieves a 12-fold increase in speed over an off-chip memory version and, therefore, it increases the efficiency of parallel execution from 4% to 46%. We also find that our method running on a GeForce GTX 580 card is approximately 14 times faster than a fully optimized CPU-based implementation running on four cores. Some multimodal registration results are also provided to understand the limitation of our method. We believe that our highly efficient method, which completes an alignment task within a few tens of second, will be useful to realize rapid nonrigid registration.
(Kei Ikeda, Fumihiko Ino, and Kenichi Hagihara: “Efficient Acceleration of Mutual Information Computation for Nonrigid Registration Using CUDA”. Accepted for publication in the IEEE Journal of Biomedical and Health Informatics. [DOI])
As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality systematically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: 1) the GPU’s hierarchy of threads, warps, threadblocks, and sets of active threads, 2) conditional and non-uniform latencies, 3) cache associativity, 4) miss-status holding-registers, and 5) warp divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.
(Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, Henri Bal: “A Detailed GPU Cache Model Based on Reuse Distance Theory”, in High Performance Computer Architecture (HPCA), 2014, [PDF])
Petascale supercomputers create new opportunities for the study of the structure and function of large biomolecular complexes such as viruses and photosynthetic organelles, permitting all-atom molecular dynamics simulations of tens to hundreds of millions of atoms. Together with simulation and analysis, visualization provides researchers with a powerful “computational microscope”. Petascale molecular dynamics simulations produce tens to hundreds of terabytes of data that can be impractical to transfer to remote facilities, making it necessary to perform visualization and analysis tasks in-place on the supercomputer where the data are generated. We describe the adaptation of key visualization features of VMD, a widely used molecular visualization and analysis tool, for GPU-accelerated petascale computers. We discuss early experiences adapting ray tracing algorithms for GPUs, and compare rendering performance for recent petascale molecular simulation test cases on Cray XE6 (CPU-only) and XK7 (GPU-accelerated) compute nodes. Finally, we highlight opportunities for further algorithmic improvements and optimizations.
(John E. Stone, Kirby L. Vandivort, and Klaus Schulten: “GPU-Accelerated Molecular Visualization on Petascale Supercomputing Platforms”. UltraVis’13: Proceedings of the 8th International Workshop on Ultrascale Visualization, pp. 6:1-6:8, 2013. [DOI])
PARALUTION is a library for sparse iterative methods which can be performed on various parallel devices, including multi-core CPU, GPU (CUDA and OpenCL) and Intel Xeon Phi. The new 0.6.0 version provides the following new features:
- Windows support (OpenMP backend)
- FGMRES (Flexible GMRES)
- (R)CMK (Cuthill–McKee) ordering
- Thread-core affiliation (for Host OpenMP)
- Asynchronous transfers (CUDA backend)
- Pinned memory allocation on the host when using CUDA backend
- Verbose output for debugging
- Easy to handle timing function in the examples
PARALUTION 0.6.0 is available at http://www.paralution.com.
The new free open-source PyViennaCL 1.0.0 release provides the Python bindings for the ViennaCL linear algebra and numerical computation library for GPGPU and heterogeneous systems. ViennaCL itself is a header-only C++ library, so these bindings make available to Python programmers ViennaCL’s fast OpenCL and CUDA algorithms, in a way that is idiomatic and compatible with the Python community’s most popular scientific packages, NumPy and SciPy. Support through the Google Summer of Code 2013 for the primary developer Toby St Clere Smithe is greatly appreciated.
More information and download: PyViennaCL Home
This tutorial by Dan Cyca outlines the shared memory configurations for NVIDIA Fermi and Kepler architectures, and demonstrates how to rewrite kernels to take advantage of the changes in Kepler’s shared memory architecture.
Developed in partnership with NVIDIA, this hands-on four day course will teach how to write and optimize applications that fully leverage the multi-core processing capabilities of the GPU. Benefits include:
- Hands-on exercises and progressive lectures
- Individual laptops equipped with NVIDIA GPUs for student use
- Small class sizes to maximize learning
- 90 days post training support – NEW!
February 25-28, 2014, Baltimore, MD, USA, details and registration.