While new power-efficient computer architectures exhibit spectacular theoretical peak performance, they require specific conditions to operate efficiently, which makes porting complex algorithms a challenge. Here, we report results of the semi-implicit method for pressure linked equations (SIMPLE) and the pressure implicit with operator splitting (PISO) methods implemented on the graphics processing unit (GPU). We examine the advantages and disadvantages of the full porting over a partial acceleration of these algorithms run on unstructured meshes. We found that the full-port strategy requires adjusting the internal data structures to the new hardware and proposed a convenient format for storing internal data structures on GPUs. Our implementation is validated on standard steady and unsteady problems and its computational efficiency is checked by comparing its results and run times with those of some standard software (OpenFOAM) run on central processing unit (CPU). The results show that a server-class GPU outperforms a server-class dual-socket multi-core CPU system running essentially the same algorithm by up to a factor of 4.
See also supplementary materials and the follow up at http://vratis.com/blog/?p=7.
(Tadeusz Tomczak, Katarzyna Zadarnowska, Zbigniew Koza, Maciej Matyka and Łukasz Mirosław: “Acceleration of iterative Navier-Stokes solvers on graphics processing units”, International Journal of Computational Fluid Dynamics, accepted, July 2013. [DOI])
From a recent press release:
AMD’s APP SDK is an essential resource for developers who wish to leverage the processing power of heterogeneous computing. OpenCL™ is the primary mechanism for achieving this today, but AMD’s goal is to enable developers to accelerate applications with the programming paradigm of their choice. Toward that end, AMD has added support for heterogeneous libraries such as the newly released Bolt open source C++ template library and OpenCV computer vision library which now includes heterogeneous acceleration.
New to APP SDK 2.8.1:
Bolt: With the recent launch of Bolt 1.0, AMD has added several samples to the APP SDK to demonstrate Bolt 1.0 features. These showcase the usage of Bolt APIs such as scan, sort, reduce and transform. Other new samples highlight the ease of porting from STL and the performance benefits achieved over equivalent STL implementations. We’ve also included samples to demonstrate the different fallback options available in Bolt 1.0 when no GPU is available which ensure your code runs correctly on any platform.
OpenCV: AMD has been working closely with the OpenCV open source community to add heterogeneous acceleration capability to the world’s most popular computer vision library. These changes are already integrated into OpenCV and are readily available for developers who want to improve performance and efficiency of their computer vision applications. AMD has included samples to illustrate these improvements and highlight how simple it is to include them in your app.
GCN: AMD recently launched its new Graphics Core Next (GCN) architecture on several AMD products. GCN is based on a scalar architecture vs. the VLIW vector architecture of prior generations, so hand-tuned vectorization to optimize hardware utilization is no longer needed. We’ve modified several samples in AMD APP SDK 2.8.1 to show the ease of writing scalar code as compared to vectorization.
For more information, see developer.amd.com.
The Thrust team is pleased to announce the release of Thrust v1.7, an open-source C++ library for developing high-performance parallel applications. Modeled after the C++ Standard Template Library, Thrust brings a familiar abstraction layer to the realm of parallel computing
Thrust 1.7.0 introduces a new interface for controlling algorithm execution as well as several new algorithms and performance improvements. With this new interface, users may directly control how algorithms execute as well as details such as the allocation of temporary storage. Key/value versions of thrust::merge and the set operation algorithms have been added, as well stencil versions of partitioning algorithms. For 32b types, new CUDA merge and set operations provide 2-15x faster performance while a new CUDA comparison sort provides 1.3-4x faster performance.
Thrust is open-source software distributed under the OSI-approved Apache License 2.0.
LibBi is a software package for state-space modelling and Bayesian inference on modern computer hardware, including multi-core central processing units (CPUs), many-core graphics processing units (GPUs) and distributed-memory clusters of such devices. The software parses a domain-specific language for model specification, then optimises, generates, compiles and runs code for the given model, inference method and hardware platform. In presenting the software, this work serves as an introduction to state-space models and the specialised methods developed for Bayesian inference with them. The focus is on sequential Monte Carlo (SMC) methods such as the particle filter for state estimation, and the particle Markov chain Monte Carlo (PMCMC) and SMC^2 methods for parameter estimation. All are well-suited to current computer hardware. Two examples are given and developed throughout, one a linear three-element windkessel model of the human arterial system, the other a nonlinear Lorenz ’96 model. These are specified in the prescribed modelling language, and LibBi demonstrated by performing inference with them. Empirical results are presented, including a performance comparison of the software with different hardware configurations.
(Lamwrence M. Murray: “Bayesian state-space modelling on high-performance hardware using LibBi”, Preprint, June 2013. [arXiv])
Graphics processing units (GPUs) are used today in a wide range of applications, mainly because they can dramatically accelerate parallel computing, are affordable and energy efficient. In the field of medical imaging, GPUs are in some cases crucial for enabling practical use of computationally demanding algorithms. This review presents the past and present work on GPU accelerated medical image processing, and is meant to serve as an overview and introduction to existing GPU implementations. The review covers GPU acceleration of basic image processing operations (filtering, interpolation, histogram estimation and distance transforms), the most commonly used algorithms in medical imaging (image registration, image segmentation and image denoising) and algorithms that are specific to individual modalities (CT, PET, SPECT, MRI, fMRI, DTI, ultrasound, optical imaging and microscopy). The review ends by highlighting some future possibilities and challenges.
(Eklund, A., Dufort, P., Forsberg, D., LaConte, S.M., Medical Image Processing on the GPU – Past, Present and Future, Medical Image Analysis. [DOI])
This 1-hour webinar (June 11, 10am-11am PST) introduces the powerful OpenCV library, shows how this library has been accelerated using CUDA on NVIDIA GPUs, and demonstrates how to use the OpenCV GPU library to create lightning-fast applications. Free registration: http://bit.ly/11eqoaJ
We present a novel, Linear Programming (LP) based scheduling algorithm that exploits heterogeneous multi-core architectures such as CPUs and GPUs to accelerate a wide variety of proximity queries. To represent complicated performance relationships between heterogeneous architectures and different computations of proximity queries, we propose a simple, yet accurate model that measures the expected running time of these computations. Based on this model, we formulate an optimization problem that minimizes the largest time spent on computing resources, and propose a novel, iterative LP-based scheduling algorithm. Since our method is general, we are able to apply our method into various proximity queries used in five different applications that have different characteristics. Our method achieves an order of magnitude performance improvement by using four different GPUs and two hexa-core CPUs over using a hexa-core CPU only. Unlike prior scheduling methods, our method continually improves the performance, as we add more computing resources. Also, our method achieves much higher performance improvement compared with prior methods as heterogeneity of computing resources is increased. Moreover, for one of tested applications, our method achieves even higher performance than a prior parallel method optimized manually for the application. We also show that our method provides results that are close (e.g., 75%) to the performance provided by a conservative upper bound of the ideal throughput. These results demonstrate the efficiency and robustness of our algorithm that have not been achieved by prior methods.
(Duksu Kim, Jinkyu Lee, Junghwan Lee, Insik Shin, John Kim and Sung-eui Yoon: “Scheduling in Heterogeneous Computing Environments for Proximity Queries”, IEEE Transactions on Visualization and Computer Graphics, to appear, 2013. [WWW])
The new Intel® SDK for OpenCL* Applications XE 2013 includes certified OpenCL 1.2 support for Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors using Linux* operating systems. This SDK is targeted at developers of highly parallel applications including High Performance Compute (HPC), workstations, and data analytics, to name just a few. OpenCL broadens the parallel programming options on Intel® architecture and allows developers to maximize data parallel application performance on Intel Xeon Phi coprocessors.
The Intel SDK for OpenCL Applications XE 2013 provides developers OpenCL runtime and compiler, development tools, optimization guides, code samples, and training collaterals. More information: www.intel.com/software/opencl-xe
Modern GPUs are able to perform significantly more arithmetic operations than transfers of a single word to or from global memory. Hence, many GPU kernels are limited by memory bandwidth and cannot exploit the arithmetic power of GPUs. However, the memory locality can be often improved by kernel fusion when a sequence of kernels is executed and some kernels in this sequence share data. In this paper, we show how kernels performing map, reduce or their nested combinations can be fused automatically by our source-to-source compiler. To demonstrate the usability of the compiler, we have implemented several BLAS-1 and BLAS-2 routines and show how the performance of their sequences can be improved by fusions. Compared to similar sequences using CUBLAS, our compiler is able to generate code that is up to 2.61x faster for the examples tested.
(J. Filipovič, M. Madzin, J. Fousek, L. Matyska: “Optimizing CUDA Code By Kernel Fusion – Application on BLAS”, submitted to Parallel Computing, May 2013. [preprint])
Communicating data within the graphic processing unit (GPU) memory system and between the CPU and GPU are major bottlenecks in accelerating Krylov solvers on GPUs. Communication-avoiding techniques reduce the communication cost of Krylov subspace methods by computing several vectors of a Krylov subspace “at once,” using a kernel called “matrix powers.” The matrix powers kernel is implemented on a recent generation of NVIDIA GPUs and speedups of up to 5.7 times are reported for the communication-avoiding matrix powers kernel compared to the standards prase matrix vector multiplication (SpMV) implementation.
(M. Mehri Dehnavi, Y. El-Kurdi, J. Demmel and D. Giannacopoulos: “Communication-Avoiding Krylov Techniques on Graphic Processing Units”, IEEE Transactions on Magnetics 49(5):1749-1752, May 2013. [DOI])