CfP: 3rd International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS12)

August 11th, 2012

This workshop is concerned with the comparison of high-performance computing systems through performance modeling, benchmarking or the use of tools such as simulators. We are particularly interested in research which reports the ability to measure and make tradeoffs in software/hardware co-design to improve sustained application performance. We are also keen to capture the assessment of future systems, for example through work that ensures continued application scalability through peta- and exa-scale systems.

Read the rest of this entry »

Virtual OpenCL (VCL) Cluster Platform 1.14 released

August 10th, 2012

The MOSIX group announces the release of the Virtual OpenCL (VCL) cluster platform version 1.14. This version includes the SuperCL extension that allows micro OpenCL programs to run efficiently on devices of remote nodes. VCL provides an OpenCL platform in which all the cluster devices are seen as if they are located in the hosting-node. This platform benefits OpenCL applications that can use many devices concurrently. Applications written for VCL benefit from the reduced programming complexity of a single computer, the availability of shared-memory, multi-threads and lower granularity parallelism, as well as concurrent access to devices in many nodes. With SuperCL, a programmable sequence of kernels and/or memory operations can be sent to remote devices in cluster nodes, usually with just a single network round-trip. SuperCL also offers asynchronous communication with the host, to avoid the round-trip waiting time, as well as direct access to distributed file-systems. The VCL package can be downloaded from

AMD OpenCL Webinar Series – August Line Up

August 9th, 2012

Graphics Core Next Architecture Overview

GCN is Designed to push not only the boundaries of DirectX® 11 gaming, the GCN Architecture is also AMD’s first design specifically engineered for general computing. Equipped with up to 32 compute units (2048 stream processors), each containing a scalar coprocessor, AMD’s 28nm GPUs are more than capable of handling workloads-and programming languages-traditionally exclusive to the processor. Coupled with the dramatic rise of GPU-aware programming languages like C++ AMP and OpenCL™, the GCN Architecture is truly the right architecture for the right time. Participate in this webinar to learn how you can take advantage of this new architecture in your GPGPU programs (North America – August 14, 2012 10AM Pacific Daylight savings Time; India- August 21, 2012, 5:30PM India Standard Time).

Performance Evaluation of AMD APARAPI Using Real World Applications

Read the rest of this entry »

Acceleware OpenCL, CUDA and AMP training

August 9th, 2012

The fall schedule for Acceleware’s training courses is now available.

  • OpenCL: August 21-24, 2012, Houston, TX
  • CUDA: October 2-5, 2012, San Jose, CA
  • OpenCL: October 16-19, 2012, Calgary, AB
  • CUDA: November 6-9, 2012, Houston, TX
  • CUDA: December 4-7, 2012, New York, NY – Finance Focus
  • AMP: December 11-14, 2012, Chicago, IL

More information:

CfP: GPU-Cloud 2012

August 6th, 2012

The 2012 International Workshop on GPU Computing in Clouds (GPU-Cloud 2012) will he held December 03-06 2012 in Taipei, Taiwan, in conjunction with the 4th International Conference on Cloud Computing Technology and Science. Important Dates:

  • Submission Deadline: August 17, 2012
  • Authors Notification: September 11, 2012
  • Final Manuscript Due: September 28, 2012
  • Workshop: December 04, 2012

Submission site:

Fast Visualization of Gaussian Density Surfaces for Molecular Dynamics and Particle System Trajectories

August 1st, 2012


We present an efficient algorithm for computation of surface representations enabling interactive visualization of large dynamic particle data sets. Our method is based on a GPU-accelerated data-parallel algorithm for computing a volumetric density map from Gaussian weighted particles. The algorithm extracts an isovalue surface from the computed density map, using fast GPU-accelerated Marching Cubes. This approach enables interactive frame rates for molecular dynamics simulations consisting of millions of atoms. The user can interactively adjust the display of structural detail on a continuous scale, ranging from atomic detail for in-depth analysis, to reduced detail visual representations suitable for viewing the overall architecture of molecular complexes. The extracted surface is useful for interactive visualization, and provides a basis for structure analysis methods.

(Michael Krone, John E. Stone, Thomas Ertl, and Klaus Schulten, “Fast visualization of Gaussian density surfaces for molecular dynamics and particle system trajectories”, In EuroVis – Short Papers 2012, pp. 67-71, 2012. [WWW])

A GPU-Based Multi-Swarm PSO Method for Parameter Estimation in Stochastic Biological Systems Exploiting Discrete-Time Target Series

August 1st, 2012


Parameter estimation (PE) of biological systems is one of the most challenging problems in Systems Biology. Here we present a PE method that integrates particle swarm optimization (PSO) to estimate the value of kinetic constants, and a stochastic simulation algorithm to reconstruct the dynamics of the system. The fitness of candidate solutions, corresponding to vectors of reaction constants, is defined as the point-to-point distance between a simulated dynamics and a set of experimental measures, carried out using discrete-time sampling and various initial conditions. A multi-swarm PSO topology with different modalities of particles migration is used to account for the different laboratory conditions in which the experimental data are usually sampled. The whole method has been specifically designed and entirely executed on the GPU to provide a reduction of computational costs. We show the effectiveness of our method and discuss its performances on an enzymatic kinetics and a prokaryotic gene expression network.

(M. Nobile, D. Besozzi, P. Cazzaniga, G. Mauri and D. Pescini: “A GPU-based multi-swarm PSO method for parameter estimation in stochastic biological systems exploiting discrete-time target series”,  in M. Giacobini, L. Vanneschi, W. Bush, editors, Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics, Springer, vol. 7246 of LNCS. pp. 74-85, 2012. [DOI])

OpenACC Compilers Now Available from PGI

July 27th, 2012

PGI Release 12.6 is now out. New in this release:

  • PGI Accelerator compilers — first release of the Fortran and C compilers to include comprehensive support for the OpenACC 1.0 specification including the acc cache construct and the entire OpenACC API library. See the PGI Accelerator page for a complete list of supported features.
  • CUDA Toolkit — PGI Accelerator compilers and CUDA Fortran now include support for CUDA Toolkit version 4.2; version 4.1 is now the default.

Download a free trial from the PGI website at Upcoming PGI webinar with Michael Wolfe. 9:00AM PDT, July 31st sponsored by NVIDIA: “Using OpenACC Directives with the PGI Accelerator Compilers”. Register at

Policy-based Tuning for Performance Portability and Library Co-optimization

July 22nd, 2012


Although modular programming is a fundamental software development practice, software reuse within contemporary GPU kernels is uncommon. For GPU software assets to be reusable across problem instances, they must be inherently flexible and tunable. To illustrate, we survey the performance-portability landscape for a suite of common GPU primitives, evaluating thousands of reasonable program variants across a large diversity of problem instances (microarchitecture, problem size, and data type). While individual specializations provide excellent performance for specific instances, we find no variants with universally reasonable performance. In this paper, we present a policy-based design idiom for constructing reusable, tunable software components that can be co-optimized with the enclosing kernel for the specific problem and processor at hand. In particular, this approach enables flexible granularity coarsening which allows the expensive aspects of communication and the redundant aspects of data parallelism to scale with the width of the processor rather than the problem size. From a small library of tunable device subroutines, we have constructed the fastest, most versatile GPU primitives for reduction, prefix and segmented scan, duplicate removal, reduction-by-key, sorting, and sparse graph traversal.

(Duane Merrill, Michael Garland and Andrew Grimshaw, “Policy-based Tuning for Performance Portability and Library Co-optimization”, Innovative Parallel Computing 2012. [WWW])

MC# 3.0 with GPU support

July 22nd, 2012

Version 3.0 of the MC# programming system has been released. MC# is an universal parallel programming language aimed to any parallel architecture  –  multicore processors, systems with GPU, or clusters. It is an extension of C# language and supports high-level parallel programming style.

Page 22 of 112« First...10...2021222324...304050...Last »