Developed in partnership with NVIDIA, this hands-on four day course will teach students how to write and optimize applications that fully leverage the multi-core processing capabilities of the GPU. Taught by Acceleware developers who bring real world experience to the class room, students will benefit from:
- Hands-on exercises and progressive lectures
- Individual laptops equipped with NVIDIA GPUs for student use
- Small class sizes to maximize learning
July 29 – August 1, 2013, San Jose, CA, USA. More information: http://www.acceleware.com/training/913
This class teaches the fundamentals of parallel computing with the GPU and the CUDA programming environment. Examples are based on a series of image processing algorithms, such as those in Photoshop or Instagram. Programming and running assignments on high-end GPUs is possible, even if you don’t own one yourself. The course started Monday 4th Feb 2013 so there is still time to join. More information and enrollment: https://www.udacity.com/course/cs344.
Acceleware has recently announced four courses on parallel programming:
- OpenCL on AMD APU CPUs: Jan 29 to Feb 1, 2013, Chicago, IL and Apr 9 to Apr 12, 2013, Los Angeles, CAL
- 4 Day CUDA Course with an Oil and Gas focus: Mar 12 to Mar 15, 2013, Houston, TX
- 4 Day C++ AMP Training: Apr 23 to Apr 26, 2013, Seattle, WA
More information is available on the courses’ webpages.
Although modular programming is a fundamental software development practice, software reuse within contemporary GPU kernels is uncommon. For GPU software assets to be reusable across problem instances, they must be inherently flexible and tunable. To illustrate, we survey the performance-portability landscape for a suite of common GPU primitives, evaluating thousands of reasonable program variants across a large diversity of problem instances (microarchitecture, problem size, and data type). While individual specializations provide excellent performance for specific instances, we find no variants with universally reasonable performance. In this paper, we present a policy-based design idiom for constructing reusable, tunable software components that can be co-optimized with the enclosing kernel for the specific problem and processor at hand. In particular, this approach enables flexible granularity coarsening which allows the expensive aspects of communication and the redundant aspects of data parallelism to scale with the width of the processor rather than the problem size. From a small library of tunable device subroutines, we have constructed the fastest, most versatile GPU primitives for reduction, prefix and segmented scan, duplicate removal, reduction-by-key, sorting, and sparse graph traversal.
(Duane Merrill, Michael Garland and Andrew Grimshaw, “Policy-based Tuning for Performance Portability and Library Co-optimization”, Innovative Parallel Computing 2012. [WWW])
In this paper, we characterize and analyze an increasingly popular style of programming for the GPU called Persistent Threads (PT). We present a concise formal definition for this programming style, and discuss the difference between the traditional GPU programming style (nonPT) and PT, why PT is attractive for some high-performance usage scenarios, and when using PT may or may not be appropriate. We identify limitations of the nonPT style and identify four primary use cases it could be useful in addressing— CPU-GPU synchronization, load balancing/irregular parallelism, producer-consumer locality, and global synchronization. Through micro-kernel benchmarks we show the PT approach can achieve up to an order-of-magnitude speedup over nonPT kernels, but can also result in performance loss in many cases. We conclude by discussing the hardware and software fundamentals that will influence the development of Persistent Threads as a programming style in future systems.
(Kshitij Gupta, Jeff A. Stuart and John D. Owens: “A Study of Persistent Threads Style GPU Programming for GPGPU Workloads”, Proceedings of Innovative Parallel Computing, May 2012. [WWW])
In this paper, we revisit the design of synchronization primitives—specifically barriers, mutexes, and semaphores—and how they apply to the GPU. Previous implementations are insufficient due to the discrepancies in hardware and programming model of the GPU and CPU. We create new implementations in CUDA and analyze the performance of spinning on the GPU, as well as a method of sleeping on the GPU, by running a set of memory-system benchmarks on two of the most common GPUs in use, the Tesla- and Fermi-class GPUs from NVIDIA. From our results we define higher-level principles that are valid for generic many-core processors, the most important of which is to limit the number of atomic accesses required for a synchronization operation because atomic accesses are slower than regular memory accesses. We use the results of the benchmarks to critique existing synchronization algorithms and guide our new implementations, and then define an abstraction of GPUs to classify any GPU based on the behavior of the memory system. We use this abstraction to create suitable implementations of the primitives specifically targeting the GPU, and analyze the performance of these algorithms on Tesla and Fermi. We then predict performance on future GPUs based on characteristics of the abstraction. We also examine the roles of spin waiting and sleep waiting in each primitive and how their performance varies based on the machine abstraction, then give a set of guidelines for when each strategy is useful based on the characteristics of the GPU and expected contention.
(Jeff A. Stuart and John D. Owens: “Efficient Synchronization Primitives for GPUs”, submitted October 2011. [ARXIV]).
GMAC is a user-level library that implements an Asymmetric Distributed Shared Memory model to be used by CUDA programs. An ADSM model builds a global memory space that allows CPU code to transparently access data hosted in accelerators’ (GPUs’) memories. Moreover, the coherency of the data is automatically handled by the library. This removes the necessity for manual memory transfers (cudaMemcpy) between the host and GPU memories. Furthermore, GMAC assigns a different “virtual GPU” to each host thread, and the virtual GPUs are evenly mapped to physical GPUs. This is especially useful for multi-GPU programs since each host thread can access the memory of all GPUs and simple GPU-to-GPU transfers can be performed with simple memcpy calls. Read the rest of this entry »
The GAP (Universidad Politécnica de Valencia, Spain) and HPCA (Universidad Jaume I, Spain) research groups are proud to announce the public release of rCUDA 1.0. The rCUDA Framework enables the concurrent usage of CUDA-compatible devices remotely by employing the sockets API for communication between clients and servers. Thus, it can be useful in three different environments:
- Clusters. To reduce the number of GPUs installed in High Performance Clusters. This leads to energy savings, as well as other related savings like acquisition costs, maintenance, space, cooling, etc.
- Academia. In low performance networks, to offer access to a few high performance GPUs concurrently to all the students.
- Virtual Machines. To enable the access to the CUDA facilities on the physical machine.
The current version of rCUDA (v1.0) implements all functions in the CUDA Runtime API version 2.3, excluding OpenGL and Direct3D interoperability. rCUDA 1.0 targets the Linux OS (for 32- and 64-bit architectures) on both client and server sides. The framework is free for any purpose under the terms and conditions of the GNU GPL/LGPL (where applicable) licenses.
For additional information, visit the rCUDA web page or Antonio Peña’s webpage.
Version 1.2 of Thrust, an open-source template library for developing CUDA applications, has been released. Modeled after the C++ Standard Template Library (STL), Thrust brings a familiar abstraction layer to the realm of GPU computing. This version adds several new features, including:
The Thrust web page provides a quick-start guide, online documentation, many examples and introductory slides. Thrust is open-source software distributed under the OSI-approved Apache License v2.0.
As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moore’s law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs and FPGAs are becoming very popular in PC-based heterogeneous systems for speeding up compute intensive kernels of scientific, imaging and simulation applications. GPUs can execute hundreds of concurrent threads, while FPGAs provide customized concurrency for highly parallel kernels. However, exploiting the parallelism available in these applications is currently not a push-button task. Often the programmer has to expose the application’s fine and coarse grained parallelism by using special APIs. CUDA is such a parallel-computing API that is driven by the GPU industry and is gaining significant popularity. In this work, we adapt the CUDA programming model into a new FPGA design flow called FCUDA, which efficiently maps the coarse and fine grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SPMD CUDA thread blocks into parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multi-core accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.
(Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, Jason Cong and Wen-Mei W. Hwu, FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs, Proceedings of the 7th Symposium on Application Specific Processors, pp.35-42, July 2009. DOI: 10.1109/SASP.2009.5226333)