GPU-Accelerated Analysis and Visualization of Large Structures Solved by Molecular Dynamics Flexible Fitting

March 26th, 2014


Hybrid structure fitting methods combine data from cryo-electron microscopy and X-ray crystallography with molecular dynamics simulations for the determination of all-atom structures of large biomolecular complexes. Evaluating the quality-of-fit obtained from hybrid fitting is computationally demanding, particularly in the context of a multiplicity of structural conformations that must be evaluated. Existing tools for quality-of-fit analysis and visualization have previously targeted small structures and are too slow to be used interactively for large biomolecular complexes of particular interest today such as viruses or for long molecular dynamics trajectories as they arise in protein folding. We present new data-parallel and GPU-accelerated algorithms for rapid interactive computation of quality-of-fit metrics linking all-atom structures and molecular dynamics trajectories to experimentally-determined density maps obtained from cryo-electron microscopy or X-ray crystallography. We evaluate the performance and accuracy of the new quality-of-fit analysis algorithms vis-a-vis existing tools, examine algorithm performance on GPU-accelerated desktop workstations and supercomputers, and describe new visualization techniques for results of hybrid structure fitting methods.

(John E. Stone, Ryan McGreevy, Barry Isralewitz, and Klaus Schulten: “GPU-Accelerated Analysis and Visualization of Large Structures Solved by Molecular Dynamics Flexible Fitting”. Faraday Discussion 169, 2014. [DOI])

GPU Boost on NVIDIA’s Tesla K40 GPUs

March 26th, 2014

This blog post explains GPU Boost, a new user controllable feature available on Tesla GPUs. Case studies and benchmarks for reverse time migration and an electromagnetic solver are discussed.

Efficient Acceleration of Mutual Information Computation for Nonrigid Registration Using CUDA

March 19th, 2014


In this paper, we propose an efficient acceleration method for the nonrigid registration of multimodal images that uses a graphics processing unit (GPU). The key contribution of our method is efficient utilization of on-chip memory for both normalized mutual information (NMI) computation and hierarchical B-spline deformation, which compose a well-known registration algorithm. We implement this registration algorithm as a compute unified device architecture (CUDA) program with an efficient parallel scheme and several optimization techniques such as hierarchical data organization, data reuse, and multiresolution representation. We experimentally evaluate our method with four clinical datasets consisting of up to 512x512x296 voxels. We find that exploitation of onchip memory achieves a 12-fold increase in speed over an off-chip memory version and, therefore, it increases the efficiency of parallel execution from 4% to 46%. We also find that our method running on a GeForce GTX 580 card is approximately 14 times faster than a fully optimized CPU-based implementation running on four cores. Some multimodal registration results are also provided to understand the limitation of our method. We believe that our highly efficient method, which completes an alignment task within a few tens of second, will be useful to realize rapid nonrigid registration.

(Kei Ikeda, Fumihiko Ino, and Kenichi Hagihara: “Efficient Acceleration of Mutual Information Computation for Nonrigid Registration Using CUDA”. Accepted for publication in the IEEE Journal of Biomedical and Health Informatics. [DOI])

CfP: 7th Workshop on UnConventional High Performance Computing 2014 (UCHPC 2014)

March 10th, 2014

The 7th UCHPC workshop will beheld in conjunction with Euro-Par 2014, August 25 – August 29, in Porto, Portugal.

Recent issues with the power consumption of conventional HPC hardware results in both new interest in accelerator hardware and in usage of mass-market hardware originally not designed for HPC. The most prominent examples are GPUs, but FPGAs, DSPs and embedded designs are also possible candidates to provide higher power efficiency, as they are used in energy-restriced environments, such as smartphones or tablets. The so-called “dark silicon” forecast, i.e. not all transistors may be active at the same time, may lead to even more specialized hardware in future mass-market products. Exploiting this hardware for HPC can be a worthwhile challenge.

Read the rest of this entry »

CfP: 2nd Workshop on Parallel and Distributed Agent-Based Simulations (PADABS 2014)

March 10th, 2014

Agent-Based Simulation Models are an increasingly popular tool for research and management in many fields such as ecology, economics and sociology. In some fields, such as social sciences, these models are seen as a key instrument to the generative approach, essential for understanding complex social phenomena. But also in policy-making, biology, military simulations, control of mobile robots and economics, the relevance and effectiveness of Agent-Based Simulation Models is recently recognized.

Several frameworks have been recently developed and are active in this field. They range from GPU-manycore approaches to parallel and/or distributed simulation environments.

The key objective of this workshop is to bring together researchers that are interested in getting more performances from their simulations by using synchronized, many-core simulations (e.g., GPUs), strongly coupled, parallel simulations (e.g. MPI) and loosely coupled, distributed simulations (distributed heterogeneous setting). More information:

A Detailed GPU Cache Model Based on Reuse Distance Theory

March 5th, 2014


As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality systematically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: 1) the GPU’s hierarchy of threads, warps, threadblocks, and sets of active threads, 2) conditional and non-uniform latencies, 3) cache associativity, 4) miss-status holding-registers, and 5) warp divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.

(Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, Henri Bal: “A Detailed GPU Cache Model Based on Reuse Distance Theory”, in High Performance Computer Architecture (HPCA), 2014, [PDF])

New Embedded GPU Platform for General-Purpose Computing Delivers the Highest Performance per Energy or Area

March 5th, 2014

From a recent press release:

The versatile Nema™ Platform for General-Purpose Computing on an embedded GPU (GPGPU) is designed by Think Silicon for excellent performance with ultra-low energy consumption and silicon footprint, and is available now from CAST, Inc.

Designed by graphics processing experts Think Silicon Ltd., the Nema GPU is a scalable, many-core, multi-threaded, state-of-the-art, data processing design blending both graphics rendering and general computing capabilities. It offers easy configuration, rapid programming, and straightforward system integration in a reusable soft IP core suitable for ASIC or FPGA implementation.

Read the rest of this entry »

GPU-Accelerated Molecular Visualization on Petascale Supercomputing Platforms

March 5th, 2014


Petascale supercomputers create new opportunities for the study of the structure and function of large biomolecular complexes such as viruses and photosynthetic organelles, permitting all-atom molecular dynamics simulations of tens to hundreds of millions of atoms. Together with simulation and analysis, visualization provides researchers with a powerful “computational microscope”. Petascale molecular dynamics simulations produce tens to hundreds of terabytes of data that can be impractical to transfer to remote facilities, making it necessary to perform visualization and analysis tasks in-place on the supercomputer where the data are generated. We describe the adaptation of key visualization features of VMD, a widely used molecular visualization and analysis tool, for GPU-accelerated petascale computers. We discuss early experiences adapting ray tracing algorithms for GPUs, and compare rendering performance for recent petascale molecular simulation test cases on Cray XE6 (CPU-only) and XK7 (GPU-accelerated) compute nodes. Finally, we highlight opportunities for further algorithmic improvements and optimizations.

(John E. Stone, Kirby L. Vandivort, and Klaus Schulten: “GPU-Accelerated Molecular Visualization on Petascale Supercomputing Platforms”. UltraVis’13: Proceedings of the 8th International Workshop on Ultrascale Visualization, pp. 6:1-6:8, 2013. [DOI])

Acceleware OpenCL Training June 2-5, 2014

March 5th, 2014

This hands-on four day course will teach you how to write applications in OpenCL that fully leverage the multi-core processing capabilities of the GPU. Taught by Acceleware developers who bring real world experience to the class room, students will benefit from:

  • Hands-on exercises and progressive lectures
  • Individual laptops with AMD Fusion APU for student use
  • Small class sizes to maximize learning
  • 90 days post training support

For more information please visit:

PARALUTION – new release 0.6.0

February 26th, 2014

PARALUTION is a library for sparse iterative methods which can be performed on various parallel devices, including multi-core CPU, GPU (CUDA and OpenCL) and Intel Xeon Phi. The new 0.6.0 version provides the following new features:

  • Windows support (OpenMP backend)
  • FGMRES (Flexible GMRES)
  • (R)CMK (Cuthill–McKee) ordering
  • Thread-core affiliation (for Host OpenMP)
  • Asynchronous transfers (CUDA backend)
  • Pinned memory allocation on the host when using CUDA backend
  • Verbose output for debugging
  • Easy to handle timing function in the examples

PARALUTION 0.6.0 is available at

Page 4 of 108« First...23456...102030...Last »