IWOCL (“eye-wok-ul”) is an annual meeting of developers, researchers and suppliers to promote the use, evolution and advancement of the OpenCL parallel programming open standard. IWOCL 2014 will take place in Bristol, England on May 12-13, 2014. For additional information visit http://www.iwocl.org
VexCL is a modern C++ library created for ease of GPGPU development with C++. VexCL strives to reduce the amount of boilerplate code needed to develop GPGPU applications. The library provides a convenient and intuitive notation for vector arithmetic, reduction, sparse matrix-vector multiplication, etc. The source code is available under the permissive MIT license. As of v1.0.0, VexCL provides two backends: OpenCL and CUDA. Users may choose either of those at compile time with a preprocessor macro definition. More information is available at the GitHub project page and release notes page.
AMD CodeXL is a free set of tools for GPU debugging, GPU profiling, static analysis of OpenCL kernels, and CPU profiling, including support for remote servers. For more information and download links, see: http://developer.amd.com/community/blog/2013/11/08/codexl-1-3-released/
Bolt is an STL compatible C++ template library for creating data-parallel applications using C++ (no C++ AMP / OpenCL code required). For more information about the Bolt template library and download links, see: http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/bolt-c-template-library/
AMD APP SDK has everything needed to get started with OpenCL and parallel programming. It includes OpenCL samples that are very easy to compile, as well as the Bolt and other libraries. For more information about AMD APP SDK and download links, see: http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/
The Libra 3.0 Heterogeneous Cloud Computing SDK has recently been released by GPU Systems. It supports PC, Tablet and Mobile Devices and includes a new virtualizing function for cloud compute services of local and remote CPUs and GPUs. C/C++, Java, C# and Matlab are supported. Read the full press release here.
Feature detection and extraction are essential in computer vision applications such as image matching and object recognition. The Scale-Invariant Feature Transform (SIFT) algorithm is one of the most robust approaches to detect and extract distinctive invariant features from images. However, high computational complexity makes it difficult to apply the SIFT algorithm to mobile applications. Recent developments in mobile processors have enabled heterogeneous computing on mobile devices, such as smartphones and tablets. In this paper, we present an OpenCL-based implementation of the SIFT algorithm on a smartphone, taking advantage of the mobile GPU. We carefully analyze the SIFT workloads and identify the parallelism. We implemented major steps of the SIFT algorithm using both serial C++ code and OpenCL kernels targeting mobile processors, to compare the performance of different workflows. Based on the profiling results, we partition the SIFT algorithm between the CPU and GPU in a way that best exploits the parallelism and minimizes the buffer transferring time to achieve better performance. The experimental results show that we are able to achieve 8.5 FPS for keypoints detection and 19 FPS for descriptor generation without reducing the number and the quality of the keypoints. Moreover, the heterogeneous implementation can reduce energy consumption by 41% compared to an optimized CPU-only implementation.
(Guohui Wang, Blaine Rister, and Joseph R. Cavallaro: “Workload Analysis and Efficient OpenCL-based Implementation of SIFT Algorithm on a Smartphone”, 1st IEEE Global Conference on Signal and Information Processing (GlobalSIP), Dec. 2013, [PDF])
The rise of multi- and many-core architectures also gave birth to a plethora of new parallel programming models. Among these, the open industry standard OpenCL addresses this heterogeneity of programming environments by providing a uniﬁed programming framework. The price to pay, however, is that OpenCL requires additional low-level boilerplate code, when compared to vendor-speciﬁc solutions, even if only simple operations are to be performed. Also, the uniﬁed programming framework does not automatically provide any guarantees on performance portability of a particular implementation. Thus, device-speciﬁc compute kernels are still required for obtaining good performance across different hardware architectures.
We address both, the issue of programmability and portable performance, in this work: On the one hand, a high-level programming interface for linear algebra routines allows for the convenient speciﬁcation of the operations of interest without having to go into the details of the underlying hardware. On the other hand, we discuss the underlying generator for device-speciﬁc OpenCL kernels at runtime, which is supplemented by an auto-tuning framework for portable performance as well as with work partitioning and task scheduling for multiple devices. Our benchmark results show portable performance across hardware from major vendors. In all cases, at least 75 percent of the respective vendor tuned library was obtained, while in some cases we even outperformed the reference. We further demonstrate the convenient and efficient use of our high-level interface in a multi-device setting with good scalability.
(Philippe Tillet, Karl Rupp, Siegfried Selberherr, Chin-Teng Lin: “Towards Performance-Portable, Scalable, and Convenient Linear Algebra”. 5th USENIX Workshop on Hot Topics in Parallelism (HotPar’) 2013 [PDF].)
From a recent press release:
AMD’s APP SDK is an essential resource for developers who wish to leverage the processing power of heterogeneous computing. OpenCL™ is the primary mechanism for achieving this today, but AMD’s goal is to enable developers to accelerate applications with the programming paradigm of their choice. Toward that end, AMD has added support for heterogeneous libraries such as the newly released Bolt open source C++ template library and OpenCV computer vision library which now includes heterogeneous acceleration.
New to APP SDK 2.8.1:
Bolt: With the recent launch of Bolt 1.0, AMD has added several samples to the APP SDK to demonstrate Bolt 1.0 features. These showcase the usage of Bolt APIs such as scan, sort, reduce and transform. Other new samples highlight the ease of porting from STL and the performance benefits achieved over equivalent STL implementations. We’ve also included samples to demonstrate the different fallback options available in Bolt 1.0 when no GPU is available which ensure your code runs correctly on any platform.
OpenCV: AMD has been working closely with the OpenCV open source community to add heterogeneous acceleration capability to the world’s most popular computer vision library. These changes are already integrated into OpenCV and are readily available for developers who want to improve performance and efficiency of their computer vision applications. AMD has included samples to illustrate these improvements and highlight how simple it is to include them in your app.
GCN: AMD recently launched its new Graphics Core Next (GCN) architecture on several AMD products. GCN is based on a scalar architecture vs. the VLIW vector architecture of prior generations, so hand-tuned vectorization to optimize hardware utilization is no longer needed. We’ve modified several samples in AMD APP SDK 2.8.1 to show the ease of writing scalar code as compared to vectorization.
For more information, see developer.amd.com.
The new Intel® SDK for OpenCL* Applications XE 2013 includes certified OpenCL 1.2 support for Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors using Linux* operating systems. This SDK is targeted at developers of highly parallel applications including High Performance Compute (HPC), workstations, and data analytics, to name just a few. OpenCL broadens the parallel programming options on Intel® architecture and allows developers to maximize data parallel application performance on Intel Xeon Phi coprocessors.
The Intel SDK for OpenCL Applications XE 2013 provides developers OpenCL runtime and compiler, development tools, optimization guides, code samples, and training collaterals. More information: www.intel.com/software/opencl-xe
In this paper we evaluate the promise held by lowpower GPUs for non-graphic workloads that arise in embedded systems. Towards this, we map and implement 5 benchmarks, that find utility in very different application domains, to an embedded GPU. Our results show that apart from accelerated performance, embedded GPUs are promising also because of their energy efficiency which is an important design goal for battery-driven mobile devices. We show that adopting the same optimization strategies as those used for programming high-end GPUs might lead to worse performance on embedded GPUs. This is due to restricted features of embedded GPUs, such as, limited or no user-defined memory, small instruction-set, limited number of registers, among others. We propose techniques to overcome such challenges, e.g., by distributing the workload between GPUs and multi-core CPUs, similar to the spirit of heterogeneous computation.
(Arian Maghazeh, Unmesh D. Bordoloi, Petru Eles and Zebo Peng: “General Purpose Computing on Low-Power Embedded GPUs: Has It Come of Age?”, 13th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, Samos, Greece, July 15-18, 2013. [Preprint])
The 1st International Workshop on OpenCL (IWOCL) will be held on May 13th/14th at Georgia Institute of Technology Atlanta, Georgia. IWOCL is an annual meeting of vendors, researchers and developers to promote the evolution and advancement of the OpenCL standard. The first workshop has an exciting full program, including a full day of tutorials, followed by a full day of keynotes, papers, and panels. More information can can be found here: http://iwocl.org.