The rise of multi- and many-core architectures also gave birth to a plethora of new parallel programming models. Among these, the open industry standard OpenCL addresses this heterogeneity of programming environments by providing a uniﬁed programming framework. The price to pay, however, is that OpenCL requires additional low-level boilerplate code, when compared to vendor-speciﬁc solutions, even if only simple operations are to be performed. Also, the uniﬁed programming framework does not automatically provide any guarantees on performance portability of a particular implementation. Thus, device-speciﬁc compute kernels are still required for obtaining good performance across different hardware architectures.
We address both, the issue of programmability and portable performance, in this work: On the one hand, a high-level programming interface for linear algebra routines allows for the convenient speciﬁcation of the operations of interest without having to go into the details of the underlying hardware. On the other hand, we discuss the underlying generator for device-speciﬁc OpenCL kernels at runtime, which is supplemented by an auto-tuning framework for portable performance as well as with work partitioning and task scheduling for multiple devices. Our benchmark results show portable performance across hardware from major vendors. In all cases, at least 75 percent of the respective vendor tuned library was obtained, while in some cases we even outperformed the reference. We further demonstrate the convenient and efficient use of our high-level interface in a multi-device setting with good scalability.
(Philippe Tillet, Karl Rupp, Siegfried Selberherr, Chin-Teng Lin: “Towards Performance-Portable, Scalable, and Convenient Linear Algebra”. 5th USENIX Workshop on Hot Topics in Parallelism (HotPar’) 2013 [PDF].)
In this work, we evaluate OpenCL as aprogramming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has required performance-impacting initializations that do not exist in other languages such as CUDA. Understanding these implications allows us to provide a single library with decent performance on a variety of platforms. We choose triangular solver (TRSM) and matrix multiplication (GEMM) as representative level 3 BLAS routines to implement in OpenCL. We profile TRSM to get the time distribution of the OpenCL runtime system. We then provide tuned GEMM kernels for both the NVIDIA Tesla C2050 and ATI Radeon 5870, the latest GPUs offered by both companies. We explore the benefits of using the texture cache, the performance ramifications of copying data into images, discrepancies in the OpenCL and CUDA compilers’ optimizations, and other issues that affect the performance. Experimental results show that nearly 50% of peak performance can be obtained in GEMM on both GPUs in OpenCL. We also show that the performance of these kernels is not highly portable. Finally, we propose the use of auto-tuning to better explore these kernels’ parameter space using search harness.
(Peng Du, Rick Weber, Piotr Luszczek, Stanimire Tomov, Gregory Peterson, Jack Dongarra, “From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming”, Parallel Computing 38(8):391–407, Aug. 2012. [DOI] [early techreport])
Chai is a new managed platform for GPGPU. It is a free and open source clean room workalike of the PeakStream platform. While not production-ready, the just-released alpha version is able to compile and run non-trivial PeakStream demo code on AMD and NVIDIA GPUs (e.g. conjugate gradient).
Chai combines an application virtual machine, garbage collection, auto-tuning JIT compiler, and high level array programming language implemented as an embedded domain-specific language in C++. The JIT back-end uses expectation-maximization to auto-tune and generate vectorized OpenCL. The JIT includes auto-tuned model families for GEMM and GEMV. Although originally developed for AMD GPUs, these parameterized kernel families also generalize to NVIDIA GPUs.
This paper by Takizawa et al. at Tohoku University describes a programming framework named Stream Programming with Runtime Auto-Tuning (SPRAT) that combines a high-level programming language with runtime processor selection. Today, a commodity PC can be seen as a hybrid computing system equipped with two different kinds of processors, i.e. CPU and GPU. Since the superiorities of GPUs in the performance and the power efficiency strongly depend on the system configuration and the data size determined at run time, a programmer cannot always know which processor should be used to execute a certain kernel. Therefore, this paper describes the SPRAT framework, which dynamically selects an appropriate processor so as to improve energy efficiency. The evaluation results clearly indicate that the run-time processor selection on execution of each kernel with the given data streams is promising for energy-aware computing on a hybrid computing system. (SPRAT:Runtime Processor Selection for Energy-aware Computing. Hiroyuki Takizawa, Katuto Sato, and Hiroaki Kobayashi. To appear in Proceedings of IEEE Cluster 2008 (the 3rd international workshop on automatic performance tuning).)