Developed in partnership with NVIDIA, this hands-on four day course will teach students how to write and optimize applications that fully leverage the multi-core processing capabilities of the GPU. Taught by Acceleware developers who bring real world experience to the class room, students will benefit from:
- Hands-on exercises and progressive lectures
- Individual laptops equipped with NVIDIA GPUs for student use
- Small class sizes to maximize learning
July 29 – August 1, 2013, San Jose, CA, USA. More information: http://www.acceleware.com/training/913
This webinar will present CUDA, focusing on practical aspects. The webinar will be conducted by APC, supported by NVIDIA. The webinar will be held Thursday, May 16, 2013 at 11:00-12:00 am Moscow time. Participants are asked to register at https://attendee.gotowebinar.com/register/8697482572284069888
We describe an interface and an implementation for performing Kronecker product actions on NVIDIA GPUs for multiple small 2-D matrices and 3-D arrays processed in parallel as a batch. This method is suited to cases where the Kronecker product component matrices are identical but the operands in a matrix-free application vary in the batch. Any batched GEMM (General Matrix Multiply) implementation, for example ours or the one in cuBLAS, can also be used for performing batched Kronecker products on GPUs. However, the specialized implementation presented here is faster and uses less memory. Partly this is because a simple GEMM based approach would require extra copies to and from main memory. We focus on matrix sizes less than or equal to 16, since these are the typical polynomial degrees in Finite Elements, but the implementation can be easily extended for other sizes. We obtain 143 and 285 GFlop/s for single precision real when processing matrices of size 10 and 16, respectively on NVIDIA Tesla K20c using CUDA 5.0. The corresponding speeds for 3-D array Kronecker products are 126 and 268 GFlop/s, respectively. Double precision is easily supported using the C++ template mechanism.
(Chetan Jhurani, “Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs”, submitted, April 2013. [preprint])
We present an interface and an implementation of the General Matrix Multiply (GEMM) routine for multiple small matrices processed simultaneously on NVIDIA graphics processing units (GPUs). We focus on matrix sizes under 16. The implementation can be easily extended to larger sizes. For single precision matrices, our implementation is 30% to 600% faster than the batched cuBLAS implementation distributed in the CUDA Toolkit 5.0 on NVIDIA Tesla K20c. For example, we obtain 104 GFlop/s and 216 GFlop/s when multiplying 100,000 independent matrix pairs of size 10 and 16, respectively. Similar improvement in performance is obtained for other sizes, in single and double precision for real and complex types, and when the number of matrices is smaller. Apart from our implementation, our different function interface also plays an important role in the improved performance. Applications of this software include Finite Element computation on GPUs.
(Chetan Jhurani and Paul Mullowney, “A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices”, submitted to Journal of Parallel and Distributed Computing, April 2013. [preprint])
Northeastern University and Boston University, together with NVIDIA, are hosting a “GPUs Accelerating Research” Week next month.
On the first day, Wednesday 4/24, Northeastern is hosting a day of talks focused on how graphics processors are accelerating new and interesting areas of research in novel ways. The goal of this meeting is to provide a venue for both industry and academia to come together to discuss these innovations, and explore what lies ahead in GPU acceleration. Given that we have limited space in this one-day workshop, papers not selected for presentation at the workshop will have the option to present at a poster session to be held during the workshop. Please visit our website for registration and other details.
On the second day, Thursday 4/25, Boston University is hosting an all-day CUDA and OpenACC developer’s workshop. Prerequisites for getting the most out of this workshop are a basic understanding of C and the Linux command line. More details can be found here.
The GPU Debayer software developed by Fastvideo can be used for demosaicing of raw 8-bit Bayer images to full-color 24-bit RGB format. The application employs the HQLI and DFPD algorithms and is tuned for NVIDIA GPUs, which results in very fast conversion, e.g., only 1.25 ms for Full HD image demosaicing on GeForce GTX 580. The software is freely available.
Due to ever increasing demand for fast processing of large analytical workloads, main memory column-oriented databases have attracted a lot of attention in recent years. In-memory databases eliminate the disk I/O barrier by storing the data in memory. In addition, they utilize a column-oriented data layout to offer a multi-core-friendly and memory-bandwidth-efficient processing scheme. On the other hand, recently, graphics processing units (GPUs) have emerged as powerful tools for general high-performance computing. GPUs are affordable and energy-efficient devices that deliver a massive computational power by utilizing a large number of cores and a high memory bandwidth. GPUs can be used as co-processors for query acceleration of in-memory databases. One of the main bottlenecks in GPU-acceleration of in-memory databases is the need for data to be transferred back and forward between GPU memory and RAM through a low-bandwidth PCIe bus. To address this problem, in this study, a new generation of in-memory databases is proposed that instead of keeping data in main memory stores it in GPU device memory.
(Pedram Ghodsnia: “An In-GPU-Memory Column-Oriented Database for Processing Analytical Workloads”, VLDB 2012 PhD Workshop, Istanbul, Turkey, August 2012. [PDF])
The following new webinars about NVIDIA Tesla K20 have been announced. During these live webinars, developers will be able to get answers directly from the presenters.
PARALUTION is a library for sparse iterative methods with special focus on multi-core and accelerator technology such as GPUs. In particular, it incorporates fine-grained parallel preconditioners designed to expolit modern multi-/many-core devices. Based on C++, it provides a generic and flexible design and interface which allow seamless integration with other scientific software packages. The library is open source and released under GPL. Key features are:
- OpenMP, CUDA and OpenCL support
- No special hardware/library requirement
- Portable code and results across all hardware
- Many sparse matrix formats
- Various iterative solvers/preconditioners
- Generic and robust design
- Plug-in for the finite element package Deal.II
- Documentation: user manual (pdf), reports, doxygen
More information, including documentation and case studies, is available at http://www.paralution.com.
A free, pre-alpha release of Lab4241’s GPGPU profiler is now available at www.lab4241.com. It provides source-code-line performance profiling for C or C++ code and CUDA kernels in a non-intrusive way. The profiler enables the developer to a seamless evaluation of used GPU resources (execution counts, memory access, branch diversions, etc.) per source-line, along with result evaluation in a simple, intuitive GUI, similar as with known CPU profilers like Quantify or valgrind.