This workshop is concerned with the comparison of high-performance computing systems through performance modeling, benchmarking or the use of tools such as simulators. We are particularly interested in research which reports the ability to measure and make tradeoffs in software/hardware co-design to improve sustained application performance. We are also keen to capture the assessment of future systems, for example through work that ensures continued application scalability through peta- and exa-scale systems.
CfP: 3rd International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS12)August 11th, 2012
The 2012 International Workshop on GPU Computing in Clouds (GPU-Cloud 2012) will he held December 03-06 2012 in Taipei, Taiwan, in conjunction with the 4th International Conference on Cloud Computing Technology and Science. Important Dates:
- Submission Deadline: August 17, 2012
- Authors Notification: September 11, 2012
- Final Manuscript Due: September 28, 2012
- Workshop: December 04, 2012
Submission site: http://www.easychair.org/conferences/?conf=gpucloud2012
Fast Visualization of Gaussian Density Surfaces for Molecular Dynamics and Particle System TrajectoriesAugust 1st, 2012
We present an efficient algorithm for computation of surface representations enabling interactive visualization of large dynamic particle data sets. Our method is based on a GPU-accelerated data-parallel algorithm for computing a volumetric density map from Gaussian weighted particles. The algorithm extracts an isovalue surface from the computed density map, using fast GPU-accelerated Marching Cubes. This approach enables interactive frame rates for molecular dynamics simulations consisting of millions of atoms. The user can interactively adjust the display of structural detail on a continuous scale, ranging from atomic detail for in-depth analysis, to reduced detail visual representations suitable for viewing the overall architecture of molecular complexes. The extracted surface is useful for interactive visualization, and provides a basis for structure analysis methods.
(Michael Krone, John E. Stone, Thomas Ertl, and Klaus Schulten, “Fast visualization of Gaussian density surfaces for molecular dynamics and particle system trajectories”, In EuroVis – Short Papers 2012, pp. 67-71, 2012. [WWW])
A GPU-Based Multi-Swarm PSO Method for Parameter Estimation in Stochastic Biological Systems Exploiting Discrete-Time Target SeriesAugust 1st, 2012
Parameter estimation (PE) of biological systems is one of the most challenging problems in Systems Biology. Here we present a PE method that integrates particle swarm optimization (PSO) to estimate the value of kinetic constants, and a stochastic simulation algorithm to reconstruct the dynamics of the system. The fitness of candidate solutions, corresponding to vectors of reaction constants, is defined as the point-to-point distance between a simulated dynamics and a set of experimental measures, carried out using discrete-time sampling and various initial conditions. A multi-swarm PSO topology with different modalities of particles migration is used to account for the different laboratory conditions in which the experimental data are usually sampled. The whole method has been specifically designed and entirely executed on the GPU to provide a reduction of computational costs. We show the effectiveness of our method and discuss its performances on an enzymatic kinetics and a prokaryotic gene expression network.
(M. Nobile, D. Besozzi, P. Cazzaniga, G. Mauri and D. Pescini: “A GPU-based multi-swarm PSO method for parameter estimation in stochastic biological systems exploiting discrete-time target series”, in M. Giacobini, L. Vanneschi, W. Bush, editors, Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics, Springer, vol. 7246 of LNCS. pp. 74-85, 2012. [DOI])
Although modular programming is a fundamental software development practice, software reuse within contemporary GPU kernels is uncommon. For GPU software assets to be reusable across problem instances, they must be inherently flexible and tunable. To illustrate, we survey the performance-portability landscape for a suite of common GPU primitives, evaluating thousands of reasonable program variants across a large diversity of problem instances (microarchitecture, problem size, and data type). While individual specializations provide excellent performance for specific instances, we find no variants with universally reasonable performance. In this paper, we present a policy-based design idiom for constructing reusable, tunable software components that can be co-optimized with the enclosing kernel for the specific problem and processor at hand. In particular, this approach enables flexible granularity coarsening which allows the expensive aspects of communication and the redundant aspects of data parallelism to scale with the width of the processor rather than the problem size. From a small library of tunable device subroutines, we have constructed the fastest, most versatile GPU primitives for reduction, prefix and segmented scan, duplicate removal, reduction-by-key, sorting, and sparse graph traversal.
(Duane Merrill, Michael Garland and Andrew Grimshaw, “Policy-based Tuning for Performance Portability and Library Co-optimization”, Innovative Parallel Computing 2012. [WWW])
Optimization of a Broadband Discone Antenna Design and Platform Installed Radiation Patterns Using a GPU-Accelerated Savant/WIPL-D Hybrid ApproachJuly 20th, 2012
Traditional design guidelines for broadband antennas do not always produce satisfactory performance for the desired frequency range of interest. In addition, the accurate prediction of the free-space antenna performance is not sufficient to determine if the antenna will meet a larger system requirement because the performance of the antenna can change significantly when it is installed on a platform. Antenna design software, such as WIPL-D, addresses the difficulties of designing antennas with broadband performance by providing optimization software that can automatically resize the various antenna dimensions until a desired performance criterion is met. At high-frequencies, the electrically large size of the platform makes it computationally difficult, or impossible, to directly consider the interactions between the antenna and the platform when designing the antenna in a full-wave solver. This paper describes an approach for the design and optimization of a discone antenna and then the subsequent installation on a large commercial aircraft. The antenna design will be optimized across a wide frequency range using WIPL-D Optimizer. The resulting discone antenna design is then imported into Savant-Hybrid, a hybrid asymptotic and full-wave solver, and the installed antenna performance is simulated using GPU acceleration at multiple potential antenna locations to determine the location that provides the least-degraded installed antenna performance.
(Tod Courtney, Matthew C. Miller, John E. Stone, and Robert A. Kipp: “Optimization of a Broadband Discone Antenna Design and Platform Installed Radiation Patterns Using a GPU-Accelerated Savant/WIPL-D Hybrid Approach”, Proceedings of the Applied Computational Electromagnetics Symposium
(ACES 2012), Columbus, Ohio, April 2012. [PDF])
Update: Deadline extension to July 28, 2012
Submissions are cordially invited for MCC-III, to be held in Stuttgart, Germany, September 19-21. This conference is the 3rd in a series, starting in 2010 in Heidelberg at the Heidelberg Academy of Sciences (HAW) and 2011 at the Karlsruhe Institute of Technology (KIT) and the Engineering Mathematics and Computing Lab (EMCL). It aims to combine new aspects of multi-/manycore microprocessor technologies, parallel applications, numerical simulation, software development and tools. Contributions are welcome from all participating disciplines. Particular emphasis is placed on the support and advancement of young scientists, in addition to high-quality invited keynote talks and tutorials. More information including the full call for papers, topics of interest and submission instructions: http://www.multicore-challenge.org
This paper presents results of an implementation of code generator for fast general matrix multiply (GEMM) kernels. When a set of parameters is given, the code generator produces the corresponding GEMM kernel written in OpenCL. The produced kernels are optimized for high-performance implementation on GPUs from AMD. Access latencies to GPU global memory is the main drawback for high performance. This study shows that storing matrix data in a block-major layout increases the performance and stability of GEMM kernels. On the Tahiti GPU (Radeon HD 7970), our DGEMM (double-precision GEMM) and SGEMM (single-precision GEMM) kernels achieve the performance up to 848 GFlop/s (90% of the peak) and 2646 GFlop/s (70%), respectively.
(K. Matsumoto, N. Nakasato, S. G. Sedukhin: “Implementing a code generator for fast matrix multiplication in OpenCL on the GPU”, accepted for Special Session: Auto-Tuning for Multicore and GPU (ATMG), IEEE 6th International Symposium on Embedded Multicore SoCs (MCSoC-12), Sep. 2012. [PDF])
In this paper, we characterize and analyze an increasingly popular style of programming for the GPU called Persistent Threads (PT). We present a concise formal definition for this programming style, and discuss the difference between the traditional GPU programming style (nonPT) and PT, why PT is attractive for some high-performance usage scenarios, and when using PT may or may not be appropriate. We identify limitations of the nonPT style and identify four primary use cases it could be useful in addressing— CPU-GPU synchronization, load balancing/irregular parallelism, producer-consumer locality, and global synchronization. Through micro-kernel benchmarks we show the PT approach can achieve up to an order-of-magnitude speedup over nonPT kernels, but can also result in performance loss in many cases. We conclude by discussing the hardware and software fundamentals that will influence the development of Persistent Threads as a programming style in future systems.
(Kshitij Gupta, Jeff A. Stuart and John D. Owens: “A Study of Persistent Threads Style GPU Programming for GPGPU Workloads”, Proceedings of Innovative Parallel Computing, May 2012. [WWW])
General purpose GPU recently has successfully drawn attention from high-performance computing due to higher core density and lower EPI value than CPU. The newest report of Top500 shows that there are thirty-nine supercomputing systems using GPUs to accelerate data computation: two Chinese systems called Tianhe-1A and Nebulaeare at No. 2 and No. 4 and one Japanese system called Tsubame 2.0 at No. 5 are on this list. Amazon has announced the availability of Cluster GPU Instances for Amazon EC2 to deliver the computational power of GPUs in Clouds. More and more researchers have used GPU clusters instead of CPU clusters for resolving their massive-computation problems such as high energy physics, scientific simulation, data mining, climate forecast, and earthquake prediction. As the impact of GPU on both of the academic and engineering areas is increasing rapidly, many issues of GPU cluster computing have to be addressed further in order for improving and enriching the user experience and applications of GPU cluster computing. For example, the complexity of the GPU programming such as CUDA and OpenCL is too high for users to move their applications towards this new computing platform since these programming interfaces and models are quite different from MPI or OpenMP, which are popularly used in CPU cluster computing. In addition, users lack friendly and efficient tools such as debugger and performance analyzer during the period of program development. On the other hand, the computing systems built on GPU clusters require useful tools in emergence to effectively monitor and manage GPU resources for system throughput and to maintain the QoS and reliability of the execution of user applications. As previously described, this special issue is aimed at providing a forum for researchers to present their innovative design, implementation, and experience in software of GPU cluster computing. We encourage authors to submit high-quality, original, unpublished papers. Potential topics include, but are not limited to: