October 11th, 2012
October 11th, 2012
Seeing speedups of an accelerated application is great, but what does it take to build a codebase that will last for years and across architectures? In this webinar, John Stratton will cover some of the insights gained at the University of Illinois at Urbana-Champaign from experience with computer architecture, programming languages, and application development.
The webinar will offer three main conclusions including:
- Performance portability should be more achievable than many people think.
- The number one performance-limiting factor now and in the future will be parallel scalability.
- As much as we care about performance, general libraries that will last have to be reliable as well as fast.
Register at http://www.gputechconf.com/page/gtc-express-webinar.html
September 22nd, 2012
The MicroCFD Virtual Wind Tunnel, Educational & Professional Edition, has recently been upgraded. The new version (1.8) supports multi-core CPUs and CUDA enabled GPUs and runs
significantly faster than the previous single-processor version. The results of a benchmark test on a system with an Intel quad-core CPU and an NVIDIA 96-core GPU show that an unsteady 2D or axis-symmetric compressible flow can now be run at a resolution of one million cells (Pro Edition) within a few minutes. A 3D version is currently under development and is expected to be released in 2014.
September 4th, 2012
The “Ludwig” lattice Boltzmann fluid dynamics application is a versatile application capable of simulating the hydrodynamics of complex fluids, (e.g. mixtures, surficants, liquid crystals, particle suspensions) to allow cutting-edge research into condensed matter physics. On October 3, Dr. Alan Gray from the University of Edinburgh presents a webinar on his team’s experiences in scaling the application on the Cray XK6 hybrid supercomputer. The presentation will cover:
- A review of excellent scaling up to O(1000) GPUs
- Steps taken to maximize performance on each GPU
- Designing the communication to allow efficient usage of many GPUs in parallel, including the overlapping of several stages using CUDA stream functionality
- Advanced functionality, including how to include colloidal particles in the simulation while minimizing data transfer overheads
Register at http://www.gputechconf.com/page/gtc-express-webinar.html.
August 9th, 2012
In this work, we evaluate OpenCL as aprogramming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has required performance-impacting initializations that do not exist in other languages such as CUDA. Understanding these implications allows us to provide a single library with decent performance on a variety of platforms. We choose triangular solver (TRSM) and matrix multiplication (GEMM) as representative level 3 BLAS routines to implement in OpenCL. We profile TRSM to get the time distribution of the OpenCL runtime system. We then provide tuned GEMM kernels for both the NVIDIA Tesla C2050 and ATI Radeon 5870, the latest GPUs offered by both companies. We explore the benefits of using the texture cache, the performance ramifications of copying data into images, discrepancies in the OpenCL and CUDA compilers’ optimizations, and other issues that affect the performance. Experimental results show that nearly 50% of peak performance can be obtained in GEMM on both GPUs in OpenCL. We also show that the performance of these kernels is not highly portable. Finally, we propose the use of auto-tuning to better explore these kernels’ parameter space using search harness.
(Peng Du, Rick Weber, Piotr Luszczek, Stanimire Tomov, Gregory Peterson, Jack Dongarra, “From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming”, Parallel Computing 38(8):391–407, Aug. 2012. [DOI] [early techreport])
July 22nd, 2012
The fall schedule for Acceleware’s training courses is now available.
- OpenCL: August 21-24, 2012, Houston, TX
- CUDA: October 2-5, 2012, San Jose, CA
- OpenCL: October 16-19, 2012, Calgary, AB
- CUDA: November 6-9, 2012, Houston, TX
- CUDA: December 4-7, 2012, New York, NY – Finance Focus
- AMP: December 11-14, 2012, Chicago, IL
More information: http://www.acceleware.com/training
July 20th, 2012
Although modular programming is a fundamental software development practice, software reuse within contemporary GPU kernels is uncommon. For GPU software assets to be reusable across problem instances, they must be inherently flexible and tunable. To illustrate, we survey the performance-portability landscape for a suite of common GPU primitives, evaluating thousands of reasonable program variants across a large diversity of problem instances (microarchitecture, problem size, and data type). While individual specializations provide excellent performance for specific instances, we find no variants with universally reasonable performance. In this paper, we present a policy-based design idiom for constructing reusable, tunable software components that can be co-optimized with the enclosing kernel for the specific problem and processor at hand. In particular, this approach enables flexible granularity coarsening which allows the expensive aspects of communication and the redundant aspects of data parallelism to scale with the width of the processor rather than the problem size. From a small library of tunable device subroutines, we have constructed the fastest, most versatile GPU primitives for reduction, prefix and segmented scan, duplicate removal, reduction-by-key, sorting, and sparse graph traversal.
(Duane Merrill, Michael Garland and Andrew Grimshaw, “Policy-based Tuning for Performance Portability and Library Co-optimization”, Innovative Parallel Computing 2012. [WWW])
July 20th, 2012
Traditional design guidelines for broadband antennas do not always produce satisfactory performance for the desired frequency range of interest. In addition, the accurate prediction of the free-space antenna performance is not sufficient to determine if the antenna will meet a larger system requirement because the performance of the antenna can change significantly when it is installed on a platform. Antenna design software, such as WIPL-D, addresses the difficulties of designing antennas with broadband performance by providing optimization software that can automatically resize the various antenna dimensions until a desired performance criterion is met. At high-frequencies, the electrically large size of the platform makes it computationally difficult, or impossible, to directly consider the interactions between the antenna and the platform when designing the antenna in a full-wave solver. This paper describes an approach for the design and optimization of a discone antenna and then the subsequent installation on a large commercial aircraft. The antenna design will be optimized across a wide frequency range using WIPL-D Optimizer. The resulting discone antenna design is then imported into Savant-Hybrid, a hybrid asymptotic and full-wave solver, and the installed antenna performance is simulated using GPU acceleration at multiple potential antenna locations to determine the location that provides the least-degraded installed antenna performance.
(Tod Courtney, Matthew C. Miller, John E. Stone, and Robert A. Kipp: “Optimization of a Broadband Discone Antenna Design and Platform Installed Radiation Patterns Using a GPU-Accelerated Savant/WIPL-D Hybrid Approach”, Proceedings of the Applied Computational Electromagnetics Symposium
(ACES 2012), Columbus, Ohio, April 2012. [PDF])
July 4th, 2012
The Virtual School of Computational Science and Engineering (VSCSE) helps graduate students, post-docs and young professionals from all disciplines and institutions across the country gain the skills they need to use advanced computational resources to advance their research. The VSCSE deploys conventional collaboration technologies in unconventional ways to create a national-scale virtual classroom that provides multiple high-quality audio and video channels for speakers, remote audiences, and various forms of content of immediate educational value to students.
Read the rest of this entry »
June 27th, 2012
In this paper, we characterize and analyze an increasingly popular style of programming for the GPU called Persistent Threads (PT). We present a concise formal definition for this programming style, and discuss the difference between the traditional GPU programming style (nonPT) and PT, why PT is attractive for some high-performance usage scenarios, and when using PT may or may not be appropriate. We identify limitations of the nonPT style and identify four primary use cases it could be useful in addressing— CPU-GPU synchronization, load balancing/irregular parallelism, producer-consumer locality, and global synchronization. Through micro-kernel benchmarks we show the PT approach can achieve up to an order-of-magnitude speedup over nonPT kernels, but can also result in performance loss in many cases. We conclude by discussing the hardware and software fundamentals that will influence the development of Persistent Threads as a programming style in future systems.
(Kshitij Gupta, Jeff A. Stuart and John D. Owens: “A Study of Persistent Threads Style GPU Programming for GPGPU Workloads”, Proceedings of Innovative Parallel Computing, May 2012. [WWW])
Acceleware has announced two training courses:
Developed in partnership with AMD, this four day course, August 21-24,2012, is designed for GPU Programmers who are looking to develop comprehensive skills in writing and optimizing applications that fully leverage the multi-core processing capabilities of the GPU. Register before July 31 and receive $200 off your course fee! Enter promotional code AXTEB2012.
Partnering with NVIDIA, this four day course (July 17-20, 2012) is designed for Programmers who are looking to develop comprehensive skills in writing and optimizing applications that fully leverage the multi-core processing capabilities of the GPU.