Efficient Acceleration of Mutual Information Computation for Nonrigid Registration Using CUDA

March 19th, 2014

Abstract:

In this paper, we propose an efficient acceleration method for the nonrigid registration of multimodal images that uses a graphics processing unit (GPU). The key contribution of our method is efficient utilization of on-chip memory for both normalized mutual information (NMI) computation and hierarchical B-spline deformation, which compose a well-known registration algorithm. We implement this registration algorithm as a compute unified device architecture (CUDA) program with an efficient parallel scheme and several optimization techniques such as hierarchical data organization, data reuse, and multiresolution representation. We experimentally evaluate our method with four clinical datasets consisting of up to 512x512x296 voxels. We find that exploitation of onchip memory achieves a 12-fold increase in speed over an off-chip memory version and, therefore, it increases the efficiency of parallel execution from 4% to 46%. We also find that our method running on a GeForce GTX 580 card is approximately 14 times faster than a fully optimized CPU-based implementation running on four cores. Some multimodal registration results are also provided to understand the limitation of our method. We believe that our highly efficient method, which completes an alignment task within a few tens of second, will be useful to realize rapid nonrigid registration.

(Kei Ikeda, Fumihiko Ino, and Kenichi Hagihara: “Efficient Acceleration of Mutual Information Computation for Nonrigid Registration Using CUDA”. Accepted for publication in the IEEE Journal of Biomedical and Health Informatics. [DOI])

CfP: 7th Workshop on UnConventional High Performance Computing 2014 (UCHPC 2014)

March 10th, 2014

The 7th UCHPC workshop will beheld in conjunction with Euro-Par 2014, August 25 – August 29, in Porto, Portugal.

Recent issues with the power consumption of conventional HPC hardware results in both new interest in accelerator hardware and in usage of mass-market hardware originally not designed for HPC. The most prominent examples are GPUs, but FPGAs, DSPs and embedded designs are also possible candidates to provide higher power efficiency, as they are used in energy-restriced environments, such as smartphones or tablets. The so-called “dark silicon” forecast, i.e. not all transistors may be active at the same time, may lead to even more specialized hardware in future mass-market products. Exploiting this hardware for HPC can be a worthwhile challenge.

Read the rest of this entry »

CfP: 2nd Workshop on Parallel and Distributed Agent-Based Simulations (PADABS 2014)

March 10th, 2014

Agent-Based Simulation Models are an increasingly popular tool for research and management in many fields such as ecology, economics and sociology. In some fields, such as social sciences, these models are seen as a key instrument to the generative approach, essential for understanding complex social phenomena. But also in policy-making, biology, military simulations, control of mobile robots and economics, the relevance and effectiveness of Agent-Based Simulation Models is recently recognized.

Several frameworks have been recently developed and are active in this field. They range from GPU-manycore approaches to parallel and/or distributed simulation environments.

The key objective of this workshop is to bring together researchers that are interested in getting more performances from their simulations by using synchronized, many-core simulations (e.g., GPUs), strongly coupled, parallel simulations (e.g. MPI) and loosely coupled, distributed simulations (distributed heterogeneous setting). More information: http://www.padabs.org/

A Detailed GPU Cache Model Based on Reuse Distance Theory

March 5th, 2014

Abstract:

As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality systematically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: 1) the GPU’s hierarchy of threads, warps, threadblocks, and sets of active threads, 2) conditional and non-uniform latencies, 3) cache associativity, 4) miss-status holding-registers, and 5) warp divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.

(Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, Henri Bal: “A Detailed GPU Cache Model Based on Reuse Distance Theory”, in High Performance Computer Architecture (HPCA), 2014, [PDF])

New Embedded GPU Platform for General-Purpose Computing Delivers the Highest Performance per Energy or Area

March 5th, 2014

From a recent press release:

The versatile Nema™ Platform for General-Purpose Computing on an embedded GPU (GPGPU) is designed by Think Silicon for excellent performance with ultra-low energy consumption and silicon footprint, and is available now from CAST, Inc.

Designed by graphics processing experts Think Silicon Ltd., the Nema GPU is a scalable, many-core, multi-threaded, state-of-the-art, data processing design blending both graphics rendering and general computing capabilities. It offers easy configuration, rapid programming, and straightforward system integration in a reusable soft IP core suitable for ASIC or FPGA implementation.

Read the rest of this entry »

GPU-Accelerated Molecular Visualization on Petascale Supercomputing Platforms

March 5th, 2014

Abstract:

Petascale supercomputers create new opportunities for the study of the structure and function of large biomolecular complexes such as viruses and photosynthetic organelles, permitting all-atom molecular dynamics simulations of tens to hundreds of millions of atoms. Together with simulation and analysis, visualization provides researchers with a powerful “computational microscope”. Petascale molecular dynamics simulations produce tens to hundreds of terabytes of data that can be impractical to transfer to remote facilities, making it necessary to perform visualization and analysis tasks in-place on the supercomputer where the data are generated. We describe the adaptation of key visualization features of VMD, a widely used molecular visualization and analysis tool, for GPU-accelerated petascale computers. We discuss early experiences adapting ray tracing algorithms for GPUs, and compare rendering performance for recent petascale molecular simulation test cases on Cray XE6 (CPU-only) and XK7 (GPU-accelerated) compute nodes. Finally, we highlight opportunities for further algorithmic improvements and optimizations.

(John E. Stone, Kirby L. Vandivort, and Klaus Schulten: “GPU-Accelerated Molecular Visualization on Petascale Supercomputing Platforms”. UltraVis’13: Proceedings of the 8th International Workshop on Ultrascale Visualization, pp. 6:1-6:8, 2013. [DOI])

Acceleware OpenCL Training June 2-5, 2014

March 5th, 2014

This hands-on four day course will teach you how to write applications in OpenCL that fully leverage the multi-core processing capabilities of the GPU. Taught by Acceleware developers who bring real world experience to the class room, students will benefit from:

  • Hands-on exercises and progressive lectures
  • Individual laptops with AMD Fusion APU for student use
  • Small class sizes to maximize learning
  • 90 days post training support

For more information please visit: http://acceleware.com/training/1028

PARALUTION – new release 0.6.0

February 26th, 2014

PARALUTION is a library for sparse iterative methods which can be performed on various parallel devices, including multi-core CPU, GPU (CUDA and OpenCL) and Intel Xeon Phi. The new 0.6.0 version provides the following new features:

  • Windows support (OpenMP backend)
  • FGMRES (Flexible GMRES)
  • (R)CMK (Cuthill–McKee) ordering
  • Thread-core affiliation (for Host OpenMP)
  • Asynchronous transfers (CUDA backend)
  • Pinned memory allocation on the host when using CUDA backend
  • Verbose output for debugging
  • Easy to handle timing function in the examples

PARALUTION 0.6.0 is available at http://www.paralution.com.

PyViennaCL: Python wrapper for GPU-accelerated linear algebra

February 26th, 2014

The new free open-source PyViennaCL 1.0.0 release provides the Python bindings for the ViennaCL linear algebra and numerical computation library for GPGPU and heterogeneous systems. ViennaCL itself is a header-only C++ library, so these bindings make available to Python programmers ViennaCL’s fast OpenCL and CUDA algorithms, in a way that is idiomatic and compatible with the Python community’s most popular scientific packages, NumPy and SciPy. Support through the Google Summer of Code 2013 for the primary developer Toby St Clere Smithe is greatly appreciated.

More information and download: PyViennaCL Home

Webinar: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs

February 26th, 2014

On March 5 at 11:00am (PST), Acceleware hosts a webinar on accelerating a seismic algorithm on a cluster of AMD GPU compute nodes. The presentation will begin with an outline of the full waveform inversion (FWI) algorithm, followed by an introduction to OpenCL. The OpenCL programming model and memory spaces will be introduced. Strategies for formulating the problem to take advantage of the massively parallel GPU architecture, and key optimizations techniques are discussed including coalescing and an iterative approach to handle the slices. Performance results for the GPU are compared to the CPU run times. Click here to register.

Page 2 of 10512345...102030...Last »