This tutorial will begin with a brief overview of OpenCL and data-parallelism before focusing on the GPU programming model. We will explore the fundamentals of GPU kernels, host and device responsibilities, OpenCL syntax and work-item hierarchy. For more information and to register visit: http://acceleware.com/event/introduction-opencl-using-amd-gpus
A new book titled “Numerical Computations with GPUs” has been published:
This book brings together research on numerical methods adapted for Graphics Processing Units (GPUs). It explains recent efforts to adapt classic numerical methods, including solution of linear equations and FFT, for massively parallel GPU architectures. This volume consolidates recent research and adaptations, covering widely used methods that are at the core of many scientific and engineering computations. Each chapter is written by authors working on a specific group of methods; these leading experts provide mathematical background, parallel algorithms and implementation details leading to reusable, adaptable and scalable code fragments. This book also serves as a GPU implementation manual for many numerical algorithms, sharing tips on GPUs that can increase application efficiency. The valuable insights into parallelization strategies for GPUs are supplemented by ready-to-use code fragments. Numerical Computations with GPUs targets professionals and researchers working in high performance computing and GPU programming. Advanced-level students focused on computer science and mathematics will also find this book useful as secondary text book or reference.
From the table of contents: Read the rest of this entry »
Boost.Compute is a header-only C++ library for GPGPU and parallel-computing based on OpenCL. It provides a low-level C++ wrapper over OpenCL and high-level STL-like API with containers and algorithms for the GPU. It is available on GitHub and instructions for getting started can be found in the documentation. See the full announcement here: http://kylelutz.blogspot.com/2014/07/boost-compute-v0.3-released.html
We present a cache-aware method for accelerating texture-based volume rendering on a graphics processing unit (GPU). Because a GPU has hierarchical architecture in terms of processing and memory units, cache optimization is important to maximize performance for memory-intensive applications. Our method localizes texture memory reference according to the location of the viewpoint and dynamically selects the width and height of thread blocks (TBs) so that each warp, which is a series of 32 threads processed simultaneously, can minimize memory access strides. We also incorporate transposed indexing of threads to perform TB-level cache optimization for specific viewpoints. Furthermore, we maximize TB size to exploit spatial locality with fewer resident TBs. For viewpoints with relatively large strides, we synchronize threads of the same TB at regular intervals to realize synchronous ray propagation. Experimental results indicate that our cache-aware method doubles the worst rendering performance compared to those provided by the CUDA and OpenCL software development kits.
(Yuki Sugimoto, Fumihiko Ino, and Kenichi Hagihara: “Improving Cache Locality for GPU-based Volume Rendering”. Parallel Computing 40(5/6): 59-69, May 2014. [DOI])
Analysis of functional magnetic resonance imaging (fMRI) data is becoming ever more computationally demanding as temporal and spatial resolutions improve, and large, publicly available data sets proliferate. Moreover, methodological improvements in the neuroimaging pipeline, such as non-linear spatial normalization, non-parametric permutation tests and Bayesian Markov Chain Monte Carlo approaches, can dramatically increase the computational burden. Despite these challenges, there do not yet exist any fMRI software packages which leverage inexpensive and powerful GPUs to perform these analyses. Here, we therefore present BROCCOLI, a free software package written in OpenCL that can be used for parallel analysis of fMRI data on a large variety of hardware configurations. BROCCOLI has, for example, been tested with an Intel CPU, an Nvidia GPU, and an AMD GPU. These tests show that parallel processing of fMRI data can lead to significantly faster analysis pipelines. This speedup can be achieved on relatively standard hardware, but further speed improvements require only a modest investment in GPU hardware. BROCCOLI (running on a GPU) can perform non-linear spatial normalization to a 1 mm3 brain template in 4–6 s, and run a second level permutation test with 10,000 permutations in about a minute. These non-parametric tests are generally more robust than their parametric counterparts, and can also enable more sophisticated analyses by estimating complicated null distributions. Additionally, BROCCOLI includes support for Bayesian first-level fMRI analysis using a Gibbs sampler. The new software is freely available under GNU GPL3 and can be downloaded from github: https://github.com/wanderine/BROCCOLI.
(A. Eklund, P. Dufort, M. Villani and S. LaConte: “BROCCOLI: Software for fast fMRI analysis on many-core CPUs and GPUs”. Front. Neuroinform. 8:24, 2014. [DOI])
This master’s thesis by Markus Konrad analyzes the potentials of GPGPU on mobile devices such as smartphones or tablets. The question was, if and how the GPU on such devices can be used to speed up certain algorithms especially in the fields of image processing. GPU computing technologies such as OpenCL, OpenGL shaders, and Android RenderScript are assessed in the thesis. The abstract reads as follows:
This thesis studies how certain popular algorithms in the field of image and audio processing can be accelerated on mobile devices by means of parallel execution on their graphics processing unit (GPU). Several technologies with which this can be achieved are compared in terms of possible performance improvements, hardware and software support, as well as limitations of their programming model and functionality. The results of this research are applied in a practical project, consisting of performance improvements for marker detection in an Augmented Reality application for mobile devices.
PARALUTION is a library for sparse iterative methods which can be performed on various parallel devices, including multi-core CPU, GPU (CUDA and OpenCL) and Intel Xeon Phi. The new 0.7.0 version provides the following new features:
- Windows support – full windows support for all backends (CUDA, OpenCL, OpenMP)
- Assembling function – new OpenMP parallel assembling function for sparse matrices (includes an update function for time-dependent problems)
- Direct (dense) solvers (for very small problems)
- (Restricted) Additive Schwarz preconditioners
- MATLAB/Octave plug-in
To avoid OpenMP overhead for small sized problems, the library will compute in serial if the size of the matrix/vector is below a pre-defined threshold. Internally, the OpenCL backend has been modified for simplified cross platform compilation.
Boost.Compute v0.2 has been released! Boost.Compute is a header-only C++ library for GPGPU and parallel-computing based on OpenCL. It is available on GitHub and instructions for getting started can be found in the documentation. Since version 0.1 (released almost two months ago) new algorithms including unique(), search() and find_end() have been added, along with several bug fixes. See the project page on GitHub for more information: https://github.com/kylelutz/compute
This hands-on four day course will teach you how to write applications in OpenCL that fully leverage the multi-core processing capabilities of the GPU. Taught by Acceleware developers who bring real world experience to the class room, students will benefit from:
- Hands-on exercises and progressive lectures
- Individual laptops with AMD Fusion APU for student use
- Small class sizes to maximize learning
- 90 days post training support
For more information please visit: http://acceleware.com/training/1028
PARALUTION is a library for sparse iterative methods which can be performed on various parallel devices, including multi-core CPU, GPU (CUDA and OpenCL) and Intel Xeon Phi. The new 0.6.0 version provides the following new features:
- Windows support (OpenMP backend)
- FGMRES (Flexible GMRES)
- (R)CMK (Cuthill–McKee) ordering
- Thread-core affiliation (for Host OpenMP)
- Asynchronous transfers (CUDA backend)
- Pinned memory allocation on the host when using CUDA backend
- Verbose output for debugging
- Easy to handle timing function in the examples
PARALUTION 0.6.0 is available at http://www.paralution.com.