May 14th, 2013
May 11th, 2013
We present a novel, Linear Programming (LP) based scheduling algorithm that exploits heterogeneous multi-core architectures such as CPUs and GPUs to accelerate a wide variety of proximity queries. To represent complicated performance relationships between heterogeneous architectures and different computations of proximity queries, we propose a simple, yet accurate model that measures the expected running time of these computations. Based on this model, we formulate an optimization problem that minimizes the largest time spent on computing resources, and propose a novel, iterative LP-based scheduling algorithm. Since our method is general, we are able to apply our method into various proximity queries used in five different applications that have different characteristics. Our method achieves an order of magnitude performance improvement by using four different GPUs and two hexa-core CPUs over using a hexa-core CPU only. Unlike prior scheduling methods, our method continually improves the performance, as we add more computing resources. Also, our method achieves much higher performance improvement compared with prior methods as heterogeneity of computing resources is increased. Moreover, for one of tested applications, our method achieves even higher performance than a prior parallel method optimized manually for the application. We also show that our method provides results that are close (e.g., 75%) to the performance provided by a conservative upper bound of the ideal throughput. These results demonstrate the efficiency and robustness of our algorithm that have not been achieved by prior methods.
(Duksu Kim, Jinkyu Lee, Junghwan Lee, Insik Shin, John Kim and Sung-eui Yoon: “Scheduling in Heterogeneous Computing Environments for Proximity Queries”, IEEE Transactions on Visualization and Computer Graphics, to appear, 2013. [WWW])
May 11th, 2013
Modern GPUs are able to perform significantly more arithmetic operations than transfers of a single word to or from global memory. Hence, many GPU kernels are limited by memory bandwidth and cannot exploit the arithmetic power of GPUs. However, the memory locality can be often improved by kernel fusion when a sequence of kernels is executed and some kernels in this sequence share data. In this paper, we show how kernels performing map, reduce or their nested combinations can be fused automatically by our source-to-source compiler. To demonstrate the usability of the compiler, we have implemented several BLAS-1 and BLAS-2 routines and show how the performance of their sequences can be improved by fusions. Compared to similar sequences using CUBLAS, our compiler is able to generate code that is up to 2.61x faster for the examples tested.
(J. Filipovič, M. Madzin, J. Fousek, L. Matyska: “Optimizing CUDA Code By Kernel Fusion – Application on BLAS”, submitted to Parallel Computing, May 2013. [preprint])
April 29th, 2013
Communicating data within the graphic processing unit (GPU) memory system and between the CPU and GPU are major bottlenecks in accelerating Krylov solvers on GPUs. Communication-avoiding techniques reduce the communication cost of Krylov subspace methods by computing several vectors of a Krylov subspace “at once,” using a kernel called “matrix powers.” The matrix powers kernel is implemented on a recent generation of NVIDIA GPUs and speedups of up to 5.7 times are reported for the communication-avoiding matrix powers kernel compared to the standards prase matrix vector multiplication (SpMV) implementation.
(M. Mehri Dehnavi, Y. El-Kurdi, J. Demmel and D. Giannacopoulos: “Communication-Avoiding Krylov Techniques on Graphic Processing Units”, IEEE Transactions on Magnetics 49(5):1749-1752, May 2013. [DOI])
April 10th, 2013
In this paper we evaluate the promise held by lowpower GPUs for non-graphic workloads that arise in embedded systems. Towards this, we map and implement 5 benchmarks, that find utility in very different application domains, to an embedded GPU. Our results show that apart from accelerated performance, embedded GPUs are promising also because of their energy efficiency which is an important design goal for battery-driven mobile devices. We show that adopting the same optimization strategies as those used for programming high-end GPUs might lead to worse performance on embedded GPUs. This is due to restricted features of embedded GPUs, such as, limited or no user-defined memory, small instruction-set, limited number of registers, among others. We propose techniques to overcome such challenges, e.g., by distributing the workload between GPUs and multi-core CPUs, similar to the spirit of heterogeneous computation.
(Arian Maghazeh, Unmesh D. Bordoloi, Petru Eles and Zebo Peng: “General Purpose Computing on Low-Power Embedded GPUs: Has It Come of Age?”, 13th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, Samos, Greece, July 15-18, 2013. [Preprint])
April 9th, 2013
We describe an interface and an implementation for performing Kronecker product actions on NVIDIA GPUs for multiple small 2-D matrices and 3-D arrays processed in parallel as a batch. This method is suited to cases where the Kronecker product component matrices are identical but the operands in a matrix-free application vary in the batch. Any batched GEMM (General Matrix Multiply) implementation, for example ours or the one in cuBLAS, can also be used for performing batched Kronecker products on GPUs. However, the specialized implementation presented here is faster and uses less memory. Partly this is because a simple GEMM based approach would require extra copies to and from main memory. We focus on matrix sizes less than or equal to 16, since these are the typical polynomial degrees in Finite Elements, but the implementation can be easily extended for other sizes. We obtain 143 and 285 GFlop/s for single precision real when processing matrices of size 10 and 16, respectively on NVIDIA Tesla K20c using CUDA 5.0. The corresponding speeds for 3-D array Kronecker products are 126 and 268 GFlop/s, respectively. Double precision is easily supported using the C++ template mechanism.
(Chetan Jhurani, “Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs”, submitted, April 2013. [preprint])
March 19th, 2013
We present an interface and an implementation of the General Matrix Multiply (GEMM) routine for multiple small matrices processed simultaneously on NVIDIA graphics processing units (GPUs). We focus on matrix sizes under 16. The implementation can be easily extended to larger sizes. For single precision matrices, our implementation is 30% to 600% faster than the batched cuBLAS implementation distributed in the CUDA Toolkit 5.0 on NVIDIA Tesla K20c. For example, we obtain 104 GFlop/s and 216 GFlop/s when multiplying 100,000 independent matrix pairs of size 10 and 16, respectively. Similar improvement in performance is obtained for other sizes, in single and double precision for real and complex types, and when the number of matrices is smaller. Apart from our implementation, our different function interface also plays an important role in the improved performance. Applications of this software include Finite Element computation on GPUs.
(Chetan Jhurani and Paul Mullowney, “A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices”, submitted to Journal of Parallel and Distributed Computing, April 2013. [preprint])
March 13th, 2013
The use of graphic processor units (GPUs) has been recently proposed in computational electromagnetics to accelerate the solution of the electric field integral equation. In these methods, the linear systems obtained by using boundary elements are considered, and then an accelerated solution for a specific excitation is obtained. The existing studies are mostly focused on speeding up the filling time or the LU decomposition of that matrix. This limits the application to simple simulation scenarios if a fast method is not employed. In this paper, we propose a GPU acceleration for FFT-based integral equation solvers. We will investigate the operations involved in the solver, and we will motivate the use of GPUs. Results of numerical tests will be reported firstly on a perfect electric conductor sphere with different radii; then a realistic aircraft will be considered. We found that using GPUs for FFT-based methods allows achieving a reasonable speed-up.
(Elia A. Attardo1, Matteo A. Francavilla, Francesca Vipiana and Giuseppe Vecchi: “Investigation on Accelerating FFT-Based Methods for the EFIE on Graphics Processors”, International Journal of Numerical Modelling: Electronic Networks, Devices and Fields, to appear, Nov. 2012. [DOI])
March 12th, 2013
Due to ever increasing demand for fast processing of large analytical workloads, main memory column-oriented databases have attracted a lot of attention in recent years. In-memory databases eliminate the disk I/O barrier by storing the data in memory. In addition, they utilize a column-oriented data layout to offer a multi-core-friendly and memory-bandwidth-efficient processing scheme. On the other hand, recently, graphics processing units (GPUs) have emerged as powerful tools for general high-performance computing. GPUs are affordable and energy-efficient devices that deliver a massive computational power by utilizing a large number of cores and a high memory bandwidth. GPUs can be used as co-processors for query acceleration of in-memory databases. One of the main bottlenecks in GPU-acceleration of in-memory databases is the need for data to be transferred back and forward between GPU memory and RAM through a low-bandwidth PCIe bus. To address this problem, in this study, a new generation of in-memory databases is proposed that instead of keeping data in main memory stores it in GPU device memory.
(Pedram Ghodsnia: “An In-GPU-Memory Column-Oriented Database for Processing Analytical Workloads”, VLDB 2012 PhD Workshop, Istanbul, Turkey, August 2012. [PDF])
March 5th, 2013
Recently, general-purpose computing on graphics processing units (GPGPU) has been enabled on mobile devices thanks to the emerging heterogeneous programming models such as OpenCL. The capability of GPGPU on mobile devices opens a new era for mobile computing and can enable many computationally demanding computer vision algorithms on mobile devices. As a case study, this paper proposes to accelerate an exemplar-based inpainting algorithm for object removal on a mobile GPU using OpenCL. We discuss the methodology of exploring the parallelism in the algorithm as well as several optimization techniques. Experimental results demonstrate that our optimization strategies for mobile GPUs have significantly reduced the processing time and make computationally intensive computer vision algorithms feasible for a mobile device. To the best of the authors’ knowledge, this work is the first published implementation of general-purpose computing using OpenCL on mobile GPUs.
(Guohui Wang, Yingen Xiong, Jay Yun and Joseph R. Cavallaro: “Accelerating Computer Vision Algorithms Using OpenCL on the Mobile GPU – A Case Study”, International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}, May 2013, to appear. [PDF])
As the word “UnConventional” in the title suggests, the workshop focuses on hardware or platforms used for HPC, which were not intended for HPC in the first place. Reasons could be raw computing power, good performance per watt, or low cost in general. To address this unconventional hardware, often, new programming approaches and paradigms are required to make best use of it. A second focus of the workshop is on innovative, (yet) unconventional new programming models. To this end, UCHPC tries to capture solutions for HPC which are unconventional today but could become conventional and significant tomorrow, and thus provide a glimpse into the eventual future of HPC. The goal of the workshop is to present latest research in how hardware and software (yet) unconventional for HPC is or can be used to reach goals such as best performance per watt. UCHPC also covers according programming models, compiler techniques, and tools.
UCHPC is held in conjunction with Euro-Par 2013, August 26 – August 30, Aachen, Germany. More information,including the full call for papers, submission instructions and important dates: uchpc13.cs.tum.edu
Page 1 of 5312345...102030...»Last »