Modern GPUs are able to perform significantly more arithmetic operations than transfers of a single word to or from global memory. Hence, many GPU kernels are limited by memory bandwidth and cannot exploit the arithmetic power of GPUs. However, the memory locality can be often improved by kernel fusion when a sequence of kernels is executed and some kernels in this sequence share data. In this paper, we show how kernels performing map, reduce or their nested combinations can be fused automatically by our source-to-source compiler. To demonstrate the usability of the compiler, we have implemented several BLAS-1 and BLAS-2 routines and show how the performance of their sequences can be improved by fusions. Compared to similar sequences using CUBLAS, our compiler is able to generate code that is up to 2.61x faster for the examples tested.
(J. Filipovič, M. Madzin, J. Fousek, L. Matyska: “Optimizing CUDA Code By Kernel Fusion – Application on BLAS”, submitted to Parallel Computing, May 2013. [preprint])
Communicating data within the graphic processing unit (GPU) memory system and between the CPU and GPU are major bottlenecks in accelerating Krylov solvers on GPUs. Communication-avoiding techniques reduce the communication cost of Krylov subspace methods by computing several vectors of a Krylov subspace “at once,” using a kernel called “matrix powers.” The matrix powers kernel is implemented on a recent generation of NVIDIA GPUs and speedups of up to 5.7 times are reported for the communication-avoiding matrix powers kernel compared to the standards prase matrix vector multiplication (SpMV) implementation.
(M. Mehri Dehnavi, Y. El-Kurdi, J. Demmel and D. Giannacopoulos: “Communication-Avoiding Krylov Techniques on Graphic Processing Units”, IEEE Transactions on Magnetics 49(5):1749-1752, May 2013. [DOI])
Developed in partnership with NVIDIA, this hands-on four day course will teach students how to write and optimize applications that fully leverage the multi-core processing capabilities of the GPU. Taught by Acceleware developers who bring real world experience to the class room, students will benefit from:
- Hands-on exercises and progressive lectures
- Individual laptops equipped with NVIDIA GPUs for student use
- Small class sizes to maximize learning
July 29 – August 1, 2013, San Jose, CA, USA. More information: http://www.acceleware.com/training/913
This webinar will present CUDA, focusing on practical aspects. The webinar will be conducted by APC, supported by NVIDIA. The webinar will be held Thursday, May 16, 2013 at 11:00-12:00 am Moscow time. Participants are asked to register at https://attendee.gotowebinar.com/register/8697482572284069888
We describe an interface and an implementation for performing Kronecker product actions on NVIDIA GPUs for multiple small 2-D matrices and 3-D arrays processed in parallel as a batch. This method is suited to cases where the Kronecker product component matrices are identical but the operands in a matrix-free application vary in the batch. Any batched GEMM (General Matrix Multiply) implementation, for example ours or the one in cuBLAS, can also be used for performing batched Kronecker products on GPUs. However, the specialized implementation presented here is faster and uses less memory. Partly this is because a simple GEMM based approach would require extra copies to and from main memory. We focus on matrix sizes less than or equal to 16, since these are the typical polynomial degrees in Finite Elements, but the implementation can be easily extended for other sizes. We obtain 143 and 285 GFlop/s for single precision real when processing matrices of size 10 and 16, respectively on NVIDIA Tesla K20c using CUDA 5.0. The corresponding speeds for 3-D array Kronecker products are 126 and 268 GFlop/s, respectively. Double precision is easily supported using the C++ template mechanism.
(Chetan Jhurani, “Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs”, submitted, April 2013. [preprint])
We present an interface and an implementation of the General Matrix Multiply (GEMM) routine for multiple small matrices processed simultaneously on NVIDIA graphics processing units (GPUs). We focus on matrix sizes under 16. The implementation can be easily extended to larger sizes. For single precision matrices, our implementation is 30% to 600% faster than the batched cuBLAS implementation distributed in the CUDA Toolkit 5.0 on NVIDIA Tesla K20c. For example, we obtain 104 GFlop/s and 216 GFlop/s when multiplying 100,000 independent matrix pairs of size 10 and 16, respectively. Similar improvement in performance is obtained for other sizes, in single and double precision for real and complex types, and when the number of matrices is smaller. Apart from our implementation, our different function interface also plays an important role in the improved performance. Applications of this software include Finite Element computation on GPUs.
(Chetan Jhurani and Paul Mullowney, “A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices”, submitted to Journal of Parallel and Distributed Computing, April 2013. [preprint])
Northeastern University and Boston University, together with NVIDIA, are hosting a “GPUs Accelerating Research” Week next month.
On the first day, Wednesday 4/24, Northeastern is hosting a day of talks focused on how graphics processors are accelerating new and interesting areas of research in novel ways. The goal of this meeting is to provide a venue for both industry and academia to come together to discuss these innovations, and explore what lies ahead in GPU acceleration. Given that we have limited space in this one-day workshop, papers not selected for presentation at the workshop will have the option to present at a poster session to be held during the workshop. Please visit our website for registration and other details.
On the second day, Thursday 4/25, Boston University is hosting an all-day CUDA and OpenACC developer’s workshop. Prerequisites for getting the most out of this workshop are a basic understanding of C and the Linux command line. More details can be found here.
The GPU Debayer software developed by Fastvideo can be used for demosaicing of raw 8-bit Bayer images to full-color 24-bit RGB format. The application employs the HQLI and DFPD algorithms and is tuned for NVIDIA GPUs, which results in very fast conversion, e.g., only 1.25 ms for Full HD image demosaicing on GeForce GTX 580. The software is freely available.
Due to ever increasing demand for fast processing of large analytical workloads, main memory column-oriented databases have attracted a lot of attention in recent years. In-memory databases eliminate the disk I/O barrier by storing the data in memory. In addition, they utilize a column-oriented data layout to offer a multi-core-friendly and memory-bandwidth-efficient processing scheme. On the other hand, recently, graphics processing units (GPUs) have emerged as powerful tools for general high-performance computing. GPUs are affordable and energy-efficient devices that deliver a massive computational power by utilizing a large number of cores and a high memory bandwidth. GPUs can be used as co-processors for query acceleration of in-memory databases. One of the main bottlenecks in GPU-acceleration of in-memory databases is the need for data to be transferred back and forward between GPU memory and RAM through a low-bandwidth PCIe bus. To address this problem, in this study, a new generation of in-memory databases is proposed that instead of keeping data in main memory stores it in GPU device memory.
(Pedram Ghodsnia: “An In-GPU-Memory Column-Oriented Database for Processing Analytical Workloads”, VLDB 2012 PhD Workshop, Istanbul, Turkey, August 2012. [PDF])
The following new webinars about NVIDIA Tesla K20 have been announced. During these live webinars, developers will be able to get answers directly from the presenters.