You are here: Home » Archives for Heterogeneneous Computing
August 17th, 2011
Abstract:
We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPU-GPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divide-and-conquer algorithm that performs a fast N-body sum using a spatial decomposition and is often used in a time-stepping or iterative loop. Using the observation that the local summation and the analysis-based translation parts of the FMM are independent, we map these respectively to the GPUs and CPUs. Careful analysis of the FMM is performed to distribute work optimally between the multicore CPUs and the GPU accelerators. We first develop a single node version where the CPU part is parallelized using OpenMP and the GPU version via CUDA. New parallel algorithms for creating FMM data structures are presented together with load balancing strategies for the single node and distributed multiple-node versions. Our 8 GPU performance
is comparable with performance of a 256 GPU version of the FMM that won the 2009 Bell prize.
(Qi Hu, Nail A. Gumerov and Ramani Duraswami: “Scalable fast multipole methods on distributed heterogeneous architectures”, accepted for SC’11. [PDF])
Posted in Research | Tags: FMM, Heterogeneneous Computing, Molecular Dynamics, N-Body, Papers | 1 Comment
June 26th, 2011
We are pleased to announce a three-day workshop on “Programming of Heterogeneous Systems in Physics”, a workshop to be held on 5-7 October 2011 at Friedrich-Schiller University, Jena, Germany. This workshop will focus on:
- Solving partial differential equations efficiently on the heterogeneous computing systems. There is some emphasis on GPU computing, but other accelerators and the efficient use of large multi-core cluster nodes are considered as well.
- Optimization of computational kernels coming from finite differences, spectral methods, and lattice gauge theory on accelerators.
- We plan to have a tutorial day, two days of talks and a poster session. We plan for discussion and talks to provide an overview of current work in these areas, and to develop future lines of research and collaborations. The deadline for submission of talks is 15 August 2011.
Please visit http://wwwsfb.tpi.uni-jena.de/Events/Event-PHSP11.shtml for more information. This workshop is organised by G. Zumbusch (Chair, Jena), B. Bruegmann (Jena), A. Weyhausen (Jena), L. Rezzolla (Potsdam) and B. Zink (Tuebingen).
Posted in Events | Tags: Heterogeneneous Computing, Physics Simulation, Workshops | Write a comment
June 26th, 2011
Abstract:
This paper describes the approach and the speedup obtained in performing Smith-Waterman database searches on heterogeneous platforms comprising of multi core CPU and multi GPU systems. Most of the advanced and optimized Smith-Waterman algorithm versions have demonstrated remarkable speedup over NCBI BLAST versions, viz., SWPS3 based on x86 SSE2 instructions and CUDASW++ v2.0 CUDA implementation on GPU. This work proposes a hybrid Smith-Waterman algorithm that integrates the state-of-the art CPU and GPU solutions for accelerating Smith-Waterman algorithm in which GPU acts as a co-processor and shares the workload with the CPU enabling us to realize remarkable performance of over 70 GCUPS resulting from simultaneous CPU-GPU execution. In this work, both CPU and GPU are graded equally in performance for Smith-Waterman rather than previous approaches of porting the computationally intensive portions onto the GPUs or a naive multi-core CPU approach.
(J. Singh and I. Aruni: “Accelerating Smith-Waterman on Heterogeneous CPU-GPU Systems”, Proceedings of Bioinformatics and Biomedical Engineering (iCBBE), May 2011. [DOI])
Posted in Research | Tags: Bioinformatics, Computational Biology, Heterogeneneous Computing, Papers, Sequence Alignment | Write a comment
March 29th, 2011
Heterogeneous computing is moving into the mainstream, and a broader range of applications are already on the way. As the provider of world-class CPUs, GPUs, and APUs, AMD offers unique insight into these technologies and how they interoperate. We’ve been working with industry and academia partners to help advance real-world use of these technologies, and to understand the opportunities that lie ahead. It’s time to share what we’ve learned so far.
With tutorials, hands-on labs, and sessions that span a range of topics from HPC to multimedia, you’ll have the opportunity to expand your view of what heterogeneous computing currently offers and where it is going. You’ll hear from industry innovators and academic pioneers who are exploring different ways of approaching problems, and utilizing new paradigms in computing to help identify solutions. You’ll meet AMD experts with deep knowledge of hardware architectures and the software techniques that best leverage those platforms. And you’ll connect with other software professionals who share your passion for the future of technology.
Learn more at developer.amd.com/afds.
Posted in Developer Resources, Events | Tags: AMD, Computer Graphics, Conferences, Heterogeneneous Computing, High-Performance Computing, OpenCL, Tools, Tutorials & Courses | Write a comment
March 29th, 2011
Abstract:
We present a computational method of coupling average interpolating wavelets with high-order finite volume schemes and its implementation on heterogeneous computer architectures for the simulation of multiphase compressible flows. The method is implemented to take advantage of the parallel computing capabilities of emerging heterogeneous multicore/multi-GPU architectures. A highly efficient parallel implementation is achieved by introducing the concept of wavelet blocks, exploiting the task-based parallelism for CPU cores, and by managing asynchronously an array of GPUs by means of OpenCL. We investigate the comparative accuracy of the GPU and CPU based simulations and analyze their discrepancy for two-dimensional simulations of shock-bubble interaction and Richtmeyer–Meshkov instability. The results indicate that the accuracy of the GPU/CPU heterogeneous solver is competitive with the one that uses exclusively the CPU cores. We report the performance improvements by employing up to 12 cores and 6 GPUs compared to the single-core execution. For the simulation of the shock-bubble interaction at Mach 3 with two million grid points, we observe a 100-fold speedup for the heterogeneous part and an overall speedup of 34.
(Rossinelli D., Hejazialhosseini B., Spampinato D., Koumoutsakos P.: “Multicore/Multi-GPU Accelerated Simulations of Multiphase Compressible Flows Using Wavelet Adapted Grids”, SIAM Journal of Scientific Computing 33:512-540, 2011 [DOI])
Posted in Research | Tags: Fluid Simulation, Heterogeneneous Computing, OpenCL, Papers, wavelets | Write a comment
July 29th, 2010
Abstract:
Ocelot is a dynamic compilation framework designed to map the explicitly data parallel execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms. Ocelot includes a dynamic binary translator from Parallel Thread eXecution ISA (PTX) to many-core processors that leverages the Low Level Virtual Machine (LLVM) code generator to target x86 and other ISAs. The dynamic compiler is able to execute existing CUDA binaries without recompilation from source and supports switching between execution on an NVIDIA GPU and a many-core CPU at runtime. It has been validated against over 130 applications taken from the CUDA SDK, the UIUC Parboil benchmark, the Virginia Rodinia benchmarks, the GPU-VSIPL signal and image processing library, the Thrust library, and several domain specific applications.
This paper presents a high level overview of the implementation of the Ocelot dynamic compiler highlighting design decisions and trade-offs, and showcasing their effect on application performance. Several novel code transformations are explored that are applicable only when compiling explicitly parallel applications and traditional dynamic compiler optimizations are revisited for this new class of applications. This study is expected to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures.
This paper identifies several key areas of research and open problems for optimizing the performance of data parallel programs (such as CUDA and OpenCL) that were encountered when designing a binary translator from PTX to LLVM/x86. The complete implementation of Ocelot is available open-source under the new BSD license at http://code.google.com/p/gpuocelot. Ongoing work involves translating PTX to AMD’s IL allowing CUDA programs to be executed on AMD GPUs, developing parallel-aware PTX to PTX optimizations, and exploring new programming and execution models that are layered on PTX.
(Gregory Diamos, Andrew Kerr, Sudhakar Yalamanchili and Nathan Clark: “Ocelot: A dynamic compiler for bulk-synchroneous applications in heterogeneous systems”. 19 International Conference on Parallel Architectures and Compilation Techniques (PACT2010), September 2010).
Posted in Developer Resources, Research | Tags: Compilers, Heterogeneneous Computing, NVIDIA CUDA, Ocelot, Papers | 1 Comment
June 8th, 2010
GPU Systems has added an OpenCL back end implementation to its Libra Technology compiler and runtime architecture. Libra version 1.2 now supports x86/x64, OpenGL/OpenCL and CUDA compute back ends. The OpenCL back end generates dynamic code specifically for AMD GPUs. Also, the CUDA back end generator has been enhanced with Fermi capabilities and this new release brings full BLAS 1,2,3 matrix, vector, dense, sparse, complex, single/double standard math library functionality and access through a standard C programming interface & library. The high-level approach of the Libra API enables developers to easily extend existing high-level functionality from their favorite programming language.
Read the rest of this entry »
Posted in Business, Developer Resources | Tags: AMD, APIs, Heterogeneneous Computing, NVIDIA CUDA, OpenCL, OpenGL | Write a comment
May 13th, 2010
Abstract:
Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs).We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.
(A. R. Brodtkorb, C. Dyken, T. R. Hagen, J. M. Hjelmervik and O. O. Storaasli: “State-of-the-Art in Heterogeneous Computing”, IOS Press, 18(1) (2010), pp. 1-33. Link to PDF)
Posted in Research | Tags: Cell BE, FPGAs, GPUs, Heterogeneneous Computing, Papers | Write a comment
March 20th, 2010
Abstract:
A traditional fixed-function graphics accelerator has evolved into a programmable general-purpose graphics processing unit over the last few years. These powerful computing cores are mainly used for accelerating graphics applications or enabling low-cost scientific computing. To further reduce the cost and form factor, an emerging trend is to integrate GPU along with the memory controllers onto the same die with the processor cores. However, given such a system-on-chip, the GPU, while occupying a substantial part of the silicon, will sit idle and contribute nothing to the overall system performance when running non-graphics workloads or applications lack of data-level parallelism. In this paper, we propose COMPASS, a compute shader-assisted data prefetching scheme, to leverage the GPU resource for improving single-threaded performance on an integrated system. By harnessing the GPU shaders with very lightweight architectural support, COMPASS can emulate the functionality of a hardware-based prefetcher using the idle GPU and successfully improve the memory performance of single-thread applications. Moreover, due to its flexibility and programmability offered by COMPASS, one can implement the best performing prefetch scheme to improve each specific application as demonstrated in this paper. With COMPASS, we envision that a future application vendor can provide a custom-designed COMPASS shader bundled with their software to be loaded at runtime to optimize the performance. Our simulation results show that COMPASS can improve the single-thread performance of memory-intensive applications by 68% on average.
(Dong Hyuk Woo and Hsien-Hsin S. Lee: “COMPASS: A Programmable Data Prefetcher Using Idle GPU Shaders”. To appear in the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, Pittsburgh, PA, Mar. 2010)
Posted in Research | Tags: Heterogeneneous Computing, Papers | Write a comment