April 29th, 2013
April 29th, 2013
In this paper we evaluate the promise held by lowpower GPUs for non-graphic workloads that arise in embedded systems. Towards this, we map and implement 5 benchmarks, that find utility in very different application domains, to an embedded GPU. Our results show that apart from accelerated performance, embedded GPUs are promising also because of their energy efficiency which is an important design goal for battery-driven mobile devices. We show that adopting the same optimization strategies as those used for programming high-end GPUs might lead to worse performance on embedded GPUs. This is due to restricted features of embedded GPUs, such as, limited or no user-defined memory, small instruction-set, limited number of registers, among others. We propose techniques to overcome such challenges, e.g., by distributing the workload between GPUs and multi-core CPUs, similar to the spirit of heterogeneous computation.
(Arian Maghazeh, Unmesh D. Bordoloi, Petru Eles and Zebo Peng: “General Purpose Computing on Low-Power Embedded GPUs: Has It Come of Age?”, 13th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, Samos, Greece, July 15-18, 2013. [Preprint])
April 23rd, 2013
This webinar will present CUDA, focusing on practical aspects. The webinar will be conducted by APC, supported by NVIDIA. The webinar will be held Thursday, May 16, 2013 at 11:00-12:00 am Moscow time. Participants are asked to register at https://attendee.gotowebinar.com/register/8697482572284069888
April 14th, 2013
The LEAP (Low-energy application parallelism) conference hosts an interactive tutorial on applying formal analysis and verification techniques to OpenCL and CUDA kernels on Wed 22nd May 2013 in London,UK. Whether working on kernels for supercomputing, finance or mobile applications this tutorial will help developers overcome the common pitfalls in GPU programming such as data races and barrier divergence. Using plenty of worked examples and demos to encourage interactive discussion this session will highlight the practical benefits of using formal verification techniques to prove that kernels are free from defects. More information: http://www.leapconf.com
April 10th, 2013
This workshop, held in conjunction with RANLP 2013 on 12/13 September, aims to introduce contemporary work and to discuss novel methods for natural language processing at a large scale, and explore how the resulting technology and methods can be reused in applications both on the Web and in
the physical world. More information, including submission instructions: https://sites.google.com/site/scanlp2013
April 10th, 2013
We describe an interface and an implementation for performing Kronecker product actions on NVIDIA GPUs for multiple small 2-D matrices and 3-D arrays processed in parallel as a batch. This method is suited to cases where the Kronecker product component matrices are identical but the operands in a matrix-free application vary in the batch. Any batched GEMM (General Matrix Multiply) implementation, for example ours or the one in cuBLAS, can also be used for performing batched Kronecker products on GPUs. However, the specialized implementation presented here is faster and uses less memory. Partly this is because a simple GEMM based approach would require extra copies to and from main memory. We focus on matrix sizes less than or equal to 16, since these are the typical polynomial degrees in Finite Elements, but the implementation can be easily extended for other sizes. We obtain 143 and 285 GFlop/s for single precision real when processing matrices of size 10 and 16, respectively on NVIDIA Tesla K20c using CUDA 5.0. The corresponding speeds for 3-D array Kronecker products are 126 and 268 GFlop/s, respectively. Double precision is easily supported using the C++ template mechanism.
(Chetan Jhurani, “Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs”, submitted, April 2013. [preprint])
April 9th, 2013
The 1st International Workshop on OpenCL (IWOCL) will be held on May 13th/14th at Georgia Institute of Technology Atlanta, Georgia. IWOCL is an annual meeting of vendors, researchers and developers to promote the evolution and advancement of the OpenCL standard. The first workshop has an exciting full program, including a full day of tutorials, followed by a full day of keynotes, papers, and panels. More information can can be found here: http://iwocl.org.
April 1st, 2013
We present an interface and an implementation of the General Matrix Multiply (GEMM) routine for multiple small matrices processed simultaneously on NVIDIA graphics processing units (GPUs). We focus on matrix sizes under 16. The implementation can be easily extended to larger sizes. For single precision matrices, our implementation is 30% to 600% faster than the batched cuBLAS implementation distributed in the CUDA Toolkit 5.0 on NVIDIA Tesla K20c. For example, we obtain 104 GFlop/s and 216 GFlop/s when multiplying 100,000 independent matrix pairs of size 10 and 16, respectively. Similar improvement in performance is obtained for other sizes, in single and double precision for real and complex types, and when the number of matrices is smaller. Apart from our implementation, our different function interface also plays an important role in the improved performance. Applications of this software include Finite Element computation on GPUs.
(Chetan Jhurani and Paul Mullowney, “A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices”, submitted to Journal of Parallel and Distributed Computing, April 2013. [preprint])
March 24th, 2013
Agent-Based Simulation Models are an increasingly popular tool for research and management in many fields such as ecology, economics and sociology. In some fields, such as social sciences, these models are seen as a key instrument to the generative approach, essential for understanding complex social phenomena. But also in policy-making, biology, military simulations, control of mobile robots and economics, the relevance and effectiveness of Agent-Based Simulation Models is recently recognized.
Several frameworks have been recently developed and are active in this field. They range from GPU-manycore approaches to parallel and/or distributed simulation environments.
The key objective of this workshop is to bring together researchers that are interested in getting more performances from their simulations by using:
- synchronized, many-core simulations (e.g., GPUs)
- strongly coupled, parallel simulations (e.g. MPI)
- loosely coupled, distributed simulations (distributed heterogeneous setting)
For details please visit http://www.padabs.org/
March 19th, 2013
Northeastern University and Boston University, together with NVIDIA, are hosting a “GPUs Accelerating Research” Week next month.
On the first day, Wednesday 4/24, Northeastern is hosting a day of talks focused on how graphics processors are accelerating new and interesting areas of research in novel ways. The goal of this meeting is to provide a venue for both industry and academia to come together to discuss these innovations, and explore what lies ahead in GPU acceleration. Given that we have limited space in this one-day workshop, papers not selected for presentation at the workshop will have the option to present at a poster session to be held during the workshop. Please visit our website for registration and other details.
On the second day, Thursday 4/25, Boston University is hosting an all-day CUDA and OpenACC developer’s workshop. Prerequisites for getting the most out of this workshop are a basic understanding of C and the Linux command line. More details can be found here.
The use of graphic processor units (GPUs) has been recently proposed in computational electromagnetics to accelerate the solution of the electric field integral equation. In these methods, the linear systems obtained by using boundary elements are considered, and then an accelerated solution for a specific excitation is obtained. The existing studies are mostly focused on speeding up the filling time or the LU decomposition of that matrix. This limits the application to simple simulation scenarios if a fast method is not employed. In this paper, we propose a GPU acceleration for FFT-based integral equation solvers. We will investigate the operations involved in the solver, and we will motivate the use of GPUs. Results of numerical tests will be reported firstly on a perfect electric conductor sphere with different radii; then a realistic aircraft will be considered. We found that using GPUs for FFT-based methods allows achieving a reasonable speed-up.
(Elia A. Attardo1, Matteo A. Francavilla, Francesca Vipiana and Giuseppe Vecchi: “Investigation on Accelerating FFT-Based Methods for the EFIE on Graphics Processors”, International Journal of Numerical Modelling: Electronic Networks, Devices and Fields, to appear, Nov. 2012. [DOI])