Boost.Compute is an open-source, header-only C++ library for GPGPU and parallel-computing based on OpenCL. It provides a low-level C++ wrapper over OpenCL and high-level STL-like API with containers and algorithms for the GPU. Boost.Compute is available on GitHub and its documentation can be found here. See the full announcement here: http://kylelutz.blogspot.com/2014/12/boost-compute-0.4-released.html
PARALUTION is a library for sparse iterative methods which can be performed on various parallel devices, including multi-core CPU, GPU (CUDA and OpenCL) and Intel Xeon Phi. The new 0.8.0 release provides the following extra features:
- Complex support
- TNS, Variable preconditioner
- BiCGStab(l), QMRCGStab, FCG solvers
- RS and PairWise AMG
- SIRA eigenvalue solver
- Replace/Extract column/row functions
- Stencil computation
For details, visit http://www.paralution.com.
This webinar provides an overview of the improved analysis performance tools available in CUDA 6.0 and key optimization strategies for compute, latency and memory bound problems. The webinar includes techniques for ensuring peak utilization of CUDA cores, how to improve branching efficiency, intrinsic functions and loop unrolling. Optimal access patterns for global and shared memory are presented, including a comparison between the Fermi and Kepler architectures. To view the webinar go to: http://acceleware.com/blog/webinar-essential-cuda-optimization-techniques
Developed in partnership with NVIDIA, this hands-on four day course will teach you how to write and optimize applications that fully leverage the multi-core processing capabilities of the GPU. This course will have a finance focus. Commonly used algorithms such as random number generation and Monte Carlo simulations will be used and profiled in examples. A background in finance is not necessary. For more information please visit: http://acceleware.com/training/988
The Cf4ocl project is a GPLv3/LGPLv3 initiative to provide an object-oriented interface to the OpenCL C API with integrated profiling, promoting the rapid development of OpenCL host programs and avoiding boilerplate code. Its main goal is to allow developers to focus on OpenCL device code. After two alpha releases, the first beta is out, and can be tested on Linux, Windows and OS X. The framework is independent of the OpenCL platform version and vendor, and includes utilities to simplify the analysis of the OpenCL environment and of kernel requirements. While the project is making progress, it doesn’t yet offer OpenGL/DirectX interoperability, support for sub-devices, and doesn’t support pipes and SVM.
Cf4ocl can be downloaded from http://fakenmc.github.io/cf4ocl/.
Version 2.0 of OpenCLIPP, an Open Source OpenCL library for computer vision and image processing primitives, bas been released. For more information about the library, for programming contributions and for download, please refer to the OpenCLIPP Website.
This tutorial will begin with a brief overview of OpenCL and data-parallelism before focusing on the GPU programming model. We will explore the fundamentals of GPU kernels, host and device responsibilities, OpenCL syntax and work-item hierarchy. For more information and to register visit: http://acceleware.com/event/introduction-opencl-using-amd-gpus
CUDPP release 2.2 is a feature release that adds a new parallel primitive and improves some existing primitives. We have added cudppSuffixArray, a parallel skew algorithm (SA) implementation that computes the suffix array of a string. This suffix array primitive is now used in burrowsWheelerTransform, delivering better performance than CUDPP 2.1’s use of cudppStringSort. The new BWT is further used in cudppCompress, which is now faster than the original parallel compression and supports compression of text containing all possible unsigned char values. Some bugs in cudppMoveToFrontTransform and cudppStringSort have also been fixed. OS X users might also be interested in how we supported the use of OS X’s clang compiler in OS X Mavericks (10.9).
SpeedIT FLOW is a RANS single-phase fluid flow solver that runs fully on GPU. Benchmark results on external aero flow and other industry-relevant OpenFOAM cases on a GPU card indicate approximately 3x faster time to solution vs. Intel Xeon E5649 running 12 cores. This is about two times faster than competing solutions that offer only partial acceleration on GPU. More details are available on this blog.
This hands-on four day course teaches how to write and optimize applications that fully leverage the multi-core processing capabilities of the GPU. More details and registration: http://acceleware.com/training/986