The goal of this workshop, held in conjunction with ASPLOS XVI (Newport Beach, CA USA, March 5-6 2011) is to provide a forum to discuss new and emerging general-purpose purpose programming environments and platforms, as well as evaluate applications that have been able to harness the horsepower provided by these platforms. This year’s work is particularly interested on new heterogeneous GPU platforms. Papers are being sought on many aspects of GPUs, including (but not limited to):
- GPU applications + GPU compilation
- GPU programming environments + GPU power/efficiency
- GPU architectures + GPU benchmarking/measurements
- Multi-GPU systems + Heterogeneous GPU platforms
Paper Submission: Authors should submit a 8 page paper in ACM double-column style using the directions on the conference website at http://www.ece.neu.edu/GPGPU.
Organizers: John Cavazos (University of Delaware) and David Kaeli (Northeastern University)
Graphics processing units (GPUs) have traditionally been used in molecular modeling solely for visualization of molecular structures and animation of trajectories resulting from molecular dynamics simulations. Modern GPUs have evolved into fully programmable, massively parallel co-processors that can now be exploited to accelerate many scientific computations, typically providing about one order of magnitude speedup over CPU code and in special cases providing speedups of two orders of magnitude. This paper surveys the development of molecular modeling algorithms that leverage GPU computing, the advances already made and remaining issues to be resolved, and the continuing evolution of GPU technology that promises to become even more useful to molecular modeling. Hardware acceleration with commodity GPUs is expected to benefit the overall computational biology community by bringing teraflops performance to desktop workstations and in some cases potentially changing what were formerly batch-mode computational jobs into interactive tasks.
John E. Stone, David J. Hardy, Ivan S. Ufimtsev, and Klaus Schulten: “GPU-Accelerated Molecular Modeling Coming of Age”, Journal of Molecular Graphics and Modelling, Volume 29, Issue 2, September 2010, Pages 116-125. [DOI])
The emergence of Graphics Processing Units (GPUs) as a potential alternative to conventional general-purpose processors has led to significant interest in these architectures by both the academic community and the High Performance Computing (HPC) industry. While GPUs look likely to deliver unparalleled levels of performance, the publication of studies claiming performance improvements in excess of 30,000x are misleading. Significant on-node performance improvements have been demonstrated for code kernels and algorithms amenable to GPU acceleration; studies demonstrating comparable results for full scientific applications requiring multiple-GPU architectures are rare.
In this paper we present an analysis of a port of the NAS LU benchmark to NVIDIA’s Compute Unified Device Architecture (CUDA) – the most stable GPU programming model currently available. Our solution is also extended to multiple nodes and multiple GPU devices.
Runtime performance on several GPUs is presented, ranging from low-end, consumer-grade cards such as the 8400GS to NVIDIA’s flagship Fermi HPC processor found in the recently released C2050. We compare the runtimes of these devices to several processors including those from Intel, AMD and IBM.
In addition to this we utilise a recently developed performance model of LU. With this we predict the runtime performance of LU on large-scale distributed GPU clusters, which are predicted to become commonplace in future high-end HPC architectural solutions.
(S.J. Pennycook, S.D. Harmond, S.A. Jarvis and G.R. Mudalige: “Implementation of the NAS-LU Benchmark”, 1st International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10), held as part of Supercomputing 2010 (SC’10), New Orleans, LA, USA. [PDF])
All talks from the 2010 GPU Technology Conference (as well as archived presentations from GTC 2009) are now available from NVIDIA.
For those who missed this year’s GPU Technology Conference (GTC) , and those who attended, but had a hard time choosing between all the concurrent sessions, NVIDIA has publicly released streamed recordings, video and slides from most GTC sessions.
There is content available for all types of programmers and developers. Those just getting started programming GPUs may want to take a look at the pre-conference tutorials, which provide an in-depth look at topics such as CUDA C, OpenCL, OpenGL and Parallel Nsight.
From a press release:
SANTA CLARA, CA — (Marketwire) — 10/28/2010 — Tianhe-1A, a new supercomputer revealed today at HPC 2010 China, has set a new performance record of 2.507 petaflops, as measured by the LINPACK benchmark, making it the fastest system in China and in the world today.
Tianhe-1A epitomizes modern heterogeneous computing by coupling massively parallel GPUs with multi-core CPUs, enabling significant achievements in performance, size and power. The system uses 7,168 NVIDIA® Tesla™ M2050 GPUs and 14,336 CPUs; it would require more than 50,000 CPUs and twice as much floor space to deliver the same performance using CPUs alone.
Read the rest of this entry »
From a recent press release:
GPU Systems releases Matlab language bindings for Libra SDK – heterogenous compute platform. Libra 1.2 version with runtime compiler and environment supports x86/x64 backends, OpenGL, OpenCL and CUDA compute backends. This release brings full BLAS 1,2,3 matrix/vector, dense/sparse, real/complex, single/double math library and extended functionality to Matlab computing platform executing on x86 CPUs & GPUs from AMD and NVIDIA.
In this work, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. The algorithm processes individual images represented as 2-dimensional matrices of single-precision floating-point values using operations involving dot-products and additions. We implement this algorithm on a NVIDIA Fermi GPU (Tesla 2050) using CUDA, and also manually parallelize it for the Intel Xeon X5680 (Westmere) and IBM Power7 multi-core processors. Pthreads and OpenMP with SSE and VSX vector intrinsics are used for the manually parallelized version on the multi-core CPUs. A number of optimizations were performed for the GPU implementation on the Fermi, including blocking for Fermi’s configurable on-chip memory architecture. Experimental results illustrate that on a single multi-core processor, the manually parallelized versions of the correlation application perform only a small order of factor slower than the CUDA version executing on the Fermi – 1.005s on Power7, 3.49s on Intel X5680, and 465ms on Fermi. On a two-processor Power7 system, performance approaches that of the Fermi (650ms), while the Intel version runs in 1.78s. These results conclusively demonstrate that performance of the GPU memory subsystem is critical to effectively harness its computational capabilities. For the correlation application, a significantly higher amount of effort was put into developing the GPU version when compared to the CPU ones (several days against few hours). Our experience presents compelling evidence that performance comparable to that of GPUs can be achieved with much greater productivity on modern multi-core CPUs
(R. Bordawekar and U. Bondhugula and R. Rao: “Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU”, Technical Report, IBM T. J. Watson Research Center, 2010 [PDF])
Researchers in industry academia are invited to submit their latest research results to the “Reconfigurable and GPU Computing” track at the 9th ACS/IEEE (pending approval) International Conference on Computer Systems and Applications (AICCSA 2011). The conference website is http://www.aiccsa.org. Deadline for submission is Nov. 8, 2010.
Recofigurable & GPGPU topics include:
- Algorithms and mathematical applications
- Languages and system software
- Hardware implementation and supporting technologies
- Theoretical models and performance estimation
- Simulation environments and prototyping
- Case studies and comparisons of real-life technologies
- Run time reconfiguration
- Energy efficiency
- Architectural issues and tradeoffs
- Hybrid GPU/reconfigurable systems
- Hardware accelerators
ACUSim vortex shedding
From a recent press release:
ACUSIM Software, Inc., a leader in computational fluid dynamics (CFD) technology and solutions, today announced the immediate availability of AcuSolve™ 1.8, the latest version of ACUSIM’s leading general-purpose, finite-element based CFD solver. ACUSIM will demonstrate AcuSolve 1.8 during two free webinars, taking place at 9:30 a.m. – 10:30 a.m. ET and 6:30 p.m. – 7:30 p.m. ET, on Oct. 26, 2010, at http://www.acusim.com/html/events.html.
Used by designers and research engineers with all levels of expertise, AcuSolve is highly differentiated by its accelerated speed, robustness, accuracy and multiphysics/multidisciplinary capabilities. Contributing to its robustness is the product’s Galerkin/Least-Square (GLS) finite element formulation and novel iterative linear equation solver for the fully coupled equation system. The combination of these two powerful technologies provides a highly stable and efficient solver, capable of handling unstructured meshes with tight boundary layers automatically generated from complex industrial geometries. Read the rest of this entry »
IMPETUS Afea is proud to announce the launch of IMPETUS Afea Solver (version 1.0).
The IMPETUS Afea Solver is a non-linear explicit finite element tool. It is developed to predict large deformations of structures and components exposed to extreme loading conditions. The tool is applicable to transient dynamics and quasi-static loading conditions. The primary focus of the IMPETUS Afea Solver is accuracy, robustness and simplicity for the user. The number of purely numerical parameters that the user has to provide as input is kept at a minimum. The IMPETUS Afea Solver is adapted to GPU technology; utilizing the computational force of a potent graphics card can considerably speed up your calculations.
IMPETUS Afea Solver Video on YouTube
For more information or requests please contact email@example.com