Acceleware 4 Day CUDA Course – San Jose

May 5th, 2013

Developed in partnership with NVIDIA, this hands-on four day course will teach students how to write and optimize applications that fully leverage the multi-core processing capabilities of the GPU. Taught by Acceleware developers who bring real world experience to the class room, students will benefit from:

  • Hands-on exercises and progressive lectures
  • Individual laptops equipped with NVIDIA GPUs for student use
  • Small class sizes to maximize learning

July 29 – August 1, 2013, San Jose, CA, USA. More information:

General Purpose Computing on Low-Power Embedded GPUs: Has It Come of Age?

April 29th, 2013


In this paper we evaluate the promise held by lowpower GPUs for non-graphic workloads that arise in embedded systems. Towards this, we map and implement 5 benchmarks, that find utility in very different application domains, to an embedded GPU. Our results show that apart from accelerated performance, embedded GPUs are promising also because of their energy efficiency which is an important design goal for battery-driven mobile devices. We show that adopting the same optimization strategies as those used for programming high-end GPUs might lead to worse performance on embedded GPUs. This is due to restricted features of embedded GPUs, such as, limited or no user-defined memory, small instruction-set, limited number of registers, among others. We propose techniques to overcome such challenges, e.g., by distributing the workload between GPUs and multi-core CPUs, similar to the spirit of heterogeneous computation.

(Arian Maghazeh, Unmesh D. Bordoloi, Petru Eles and Zebo Peng: “General Purpose Computing on Low-Power Embedded GPUs: Has It Come of Age?”, 13th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, Samos, Greece, July 15-18, 2013. [Preprint])

Introduction into CUDA architecture of parallel computing webinar (in Russian)

April 29th, 2013

This webinar will present CUDA, focusing on practical aspects. The webinar will be conducted by APC, supported by NVIDIA. The webinar will be held Thursday, May 16, 2013 at 11:00-12:00 am Moscow time. Participants are asked to register at

Tutorial on Formal Analysis Techniques for OpenCL & CUDA GPU Kernels

April 23rd, 2013

The LEAP (Low-energy application parallelism) conference hosts an interactive tutorial on applying formal analysis and verification techniques to OpenCL and CUDA kernels on Wed 22nd May 2013 in London,UK. Whether working on kernels for supercomputing, finance or mobile applications this tutorial will help developers overcome the common pitfalls in GPU programming such as data races and barrier divergence. Using plenty of worked examples and demos to encourage interactive discussion this session will highlight the practical benefits of using formal verification techniques to prove that kernels are free from defects. More information:

CFP: Workshop on Scalability in Natural Language Processing

April 14th, 2013

This workshop, held in conjunction with RANLP 2013 on 12/13 September, aims to introduce contemporary work and to discuss novel methods for natural language processing at a large scale, and explore how the resulting technology and methods can be reused in applications both on the Web and in
the physical world. More information, including submission instructions:

Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs

April 10th, 2013


We describe an interface and an implementation for performing Kronecker product actions on NVIDIA GPUs for multiple small 2-D matrices and 3-D arrays processed in parallel as a batch. This method is suited to cases where the Kronecker product component matrices are identical but the operands in a matrix-free application vary in the batch. Any batched GEMM (General Matrix Multiply) implementation, for example ours or the one in cuBLAS, can also be used for performing batched Kronecker products on GPUs. However, the specialized implementation presented here is faster and uses less memory. Partly this is because a simple GEMM based approach would require extra copies to and from main memory. We focus on matrix sizes less than or equal to 16, since these are the typical polynomial degrees in Finite Elements, but the implementation can be easily extended for other sizes. We obtain 143 and 285 GFlop/s for single precision real when processing matrices of size 10 and 16, respectively on NVIDIA Tesla K20c using CUDA 5.0. The corresponding speeds for 3-D array Kronecker products are 126 and 268 GFlop/s, respectively. Double precision is easily supported using the C++ template mechanism.

(Chetan Jhurani, “Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs”, submitted, April 2013. [preprint])

1st International Workshop on OpenCL (IWOCL)

April 10th, 2013

The 1st International Workshop on OpenCL (IWOCL) will be held on May 13th/14th at Georgia Institute of Technology Atlanta, Georgia. IWOCL is an annual meeting of vendors, researchers and developers to promote the evolution and advancement of the OpenCL standard. The first workshop has an exciting full program, including a full day of tutorials, followed by a full day of keynotes, papers, and panels. More information can can be found here:

Fast GEMM for multiple small matrices on NVIDIA GPUs

April 9th, 2013


We present an interface and an implementation of the General Matrix Multiply (GEMM) routine for multiple small matrices processed simultaneously on NVIDIA graphics processing units (GPUs). We focus on matrix sizes under 16. The implementation can be easily extended to larger sizes. For single precision matrices, our implementation is 30% to 600% faster than the batched cuBLAS implementation distributed in the CUDA Toolkit 5.0 on NVIDIA Tesla K20c. For example, we obtain 104 GFlop/s and 216 GFlop/s when multiplying 100,000 independent matrix pairs of size 10 and 16, respectively. Similar improvement in performance is obtained for other sizes, in single and double precision for real and complex types, and when the number of matrices is smaller. Apart from our implementation, our different function interface also plays an important role in the improved performance. Applications of this software include Finite Element computation on GPUs.

(Chetan Jhurani and Paul Mullowney, “A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices”, submitted to Journal of Parallel and Distributed Computing, April 2013. [preprint])

CfP: Workshop on Parallel and Distributed Agent-Based Simulations (PADABS 2013)

April 1st, 2013

Agent-Based Simulation Models are an increasingly popular tool for research and management in many fields such as ecology, economics and sociology. In some fields, such as social sciences, these models are seen as a key instrument to the generative approach, essential for understanding complex social phenomena. But also in policy-making, biology, military simulations, control of mobile robots and economics, the relevance and effectiveness of Agent-Based Simulation Models is recently recognized.

Several frameworks have been recently developed and are active in this field. They range from GPU-manycore approaches to parallel and/or distributed simulation environments.

The key objective of this workshop is to bring together researchers that are interested in getting more performances from their simulations by using:

  • synchronized, many-core simulations (e.g., GPUs)
  • strongly coupled, parallel simulations (e.g. MPI)
  • loosely coupled, distributed simulations (distributed heterogeneous setting)

For details please visit

“GPUs Accelerating Research” Week at Northeastern and BU

March 24th, 2013

Northeastern University and Boston University, together with NVIDIA, are hosting a “GPUs Accelerating Research” Week next month.

On the first day, Wednesday 4/24, Northeastern is hosting a day of talks focused on how graphics processors are accelerating new and interesting areas of research in novel ways. The goal of this meeting is to provide a venue for both industry and academia to come together to discuss these innovations, and explore what lies ahead in GPU acceleration. Given that we have limited space in this one-day workshop, papers not selected for presentation at the workshop will have the option to present at a poster session to be held during the workshop. Please visit our website for registration and other details.

On the second day, Thursday 4/25, Boston University is hosting an all-day CUDA and OpenACC developer’s workshop. Prerequisites for getting the most out of this workshop are a basic understanding of C and the Linux command line. More details can be found here.

Page 12 of 109« First...1011121314...203040...Last »