Ke-Sen Huang has assembled a web page with links to all papers presented at these two important conferences, High Performance Graphics (a synthesis of the Graphics Hardware and Interactive Ray Tracing conferences) and SIGGRAPH. Both conferences had quite a number of GPGPU-related publications. Highlights from HPG include a paper on computing minimum spanning trees on the GPU, one on optimizing stream compaction on GPUs, and a study from NVIDIA on understanding the efficiency of GPUs and of wide-SIMD architectures in general on inherently imbalanced workloads like ray tracing (among others).
The course notes and supplementary material for “Beyond Programmable Shading”, a full-day course held at SIGGRAPH 2009 on August 6, are now available online.
There are strong indications that the future of interactive graphics programming is a more flexible model than today’s OpenGL/Direct3D pipelines. Graphics developers need a basic understanding of how to combine emerging parallel programming techniques and more flexible graphics processors with the traditional interactive rendering pipeline. The first half of the course introduces the trends and directions in this emerging field. Topics include: parallel graphics architectures, parallel programming models for graphics, and game-developer investigations of the use of these new capabilities in future rendering engines.
The second half of the course has leaders from graphics hardware vendors, game development, and academic research present case studies that show how general parallel computation is being combined with the traditional graphics pipeline to boost image quality and spur new graphics algorithm innovation. Each case study discusses the mix of parallel programming constructs used, details of the graphics algorithm, and how the rendering pipeline and computation interact to achieve the technical goals. Read the rest of this entry »
The MAGMA project aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures, starting with current “Multicore+GPU” systems.
The MAGMA research is based on the idea that, to address the complex challenges of the emerging hybrid environments, optimal software solutions will themselves have to hybridized, combining the strengths of different algorithms within a single framework. Building on this idea, the MAGMA group aims to design linear algebra algorithms and frameworks for hybrid manycore and GPU systems that can enable applications to fully exploit the power that each of the hybrid components offers.
MAGMA v0.1 runs on CUDA-capable GPUs and multicore CPUs, and is available now.
The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multicore architectures. This model allows programmers to specify the structure of a program as a set of filters that act upon data, and a set of communication channels between them. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on modern Graphics Processing Units (GPUs), which support abundant parallelism in hardware.
In this paper, we describe the challenges in mapping StreamIt to GPUs and propose an efficient technique to software pipeline the execution of stream programs on GPUs. We formulate this problem—both scheduling and assignment of filters to processors – as an efficient Integer Linear Program (ILP), which is then solved using ILP solvers. We also describe a novel buffer layout technique for GPUs which facilitates exploiting the high memory bandwidth available in GPUs. The proposed scheduling exploits both the scalar units in GPU, to exploit data parallelism, and multiprocessors, to exploit task and pipeline parallelism. Further it takes into consideration the synchronization and bandwidth limitations of GPUs, yielding speedups between 1.87x and 36.83x over a single threaded CPU.
(Abhishek Udupa, R. Govindarajan, Matthew J. Thazhuthaveetil: Software Pipelined Execution of Stream Programs on GPUs, International Symposium on Code Generation and Optimization 2009 (CGO 2009), pages 200–209. DOI 10.1109/CGO.2009.20, direct link to PDF)
Ocelot, developed at Georgia Tech, seeks to develop a set of tools that enable the low level analysis of GPGPU applications as well a providing a JIT compiler for generic architectures. Ocelot currently provides an implementation of the NVIDIA CUDA runtime, capable of running the entire CUDA 2.2 and 2.1 SDKs.
Ocelot features include a memory checker similar to valgrind, detection mechanisms for non-coalesced memory accesses, full device emulation, and a number of useful debugging and performance tuning features. The Roadmap lists future developments.
Sparse Matrix-Vector Multiplication Toolkit for Graphics Processing Units (SpMV4GPU) is a library optimized for NVIDIA Graphics Processing Units (GPUs). The GPU is fast emerging as the ideal architecture to use as an accelerator in a heterogenous computing environment. Modern GPUs are designed not only for accelerating traditional graphics kernels, but also for general-purpose computationally intensive kernels. The state-of-the art GPUs exhibit very high computational capabilities at a reasonable price.
Sparse Matrix-Vector Multiplication is a core numerical analysis kernel used for a wide range of application domains, such as graphics, data mining, and image processing. SpMV4GPU is a sparse matrix-vector multiplication library optimized for the NVIDIA GPUs. It is developed using the NVIDIA C for CUDA language and API, and works on all NVIDIA GPUs with CUDA support. SpMV4GPU uses the standard sparse matrix storage formats, such as compressed row and column storage formats. It hides the intricacies of GPU programming by using an abstract interface. The SpMV4GPU interface also allows users to provide optional performance hints, and optionally use special storage representations. Experimental evaluation demonstrate that the SpMV library provides two to four times improvement over the equivalent solution provided by the NVIDIA’s CUDPP library.
Along with the library, there is an IBM Research technical paper by Muthu Manikandan Baskaran andRajesh Bordawekar available, “Optimizing Sparse Matrix-Vector Multiplication on GPUs“. (Muthu Manikandan Baskaran and Rajesh Bordawekar, “Optimizing Sparse Matrix-Vector Multiplication on GPUs“. IBM Research Technical Paper RC24704, 2008.)
Release 1.1 of the CUDA Data-Parallel Primitives Library (CUDPP) is now available for download. The two major new features in CUDPP 1.1 are a very fast new radix sort implementation with support for sorting key-value pairs (with float or unsigned integer keys); and a new pseudorandom number generator, cudppRand. CUDPP 1.1 also replaces its former custom license with the standard BSD license. This greatly simplifies the CUDPP license details, and it also enables CUDPP to move into a public source repository such as Google Code in the near future. For more information, visit the CUDPP Website.
A ScientificComputing.com article by Rob Farber explores the topic of numerical precision in the context of future exascale computing, asking the question “how do we know that anything we compute is correct?” The discussion centers around processors such as GPUs which provide both single- and double-precision computation but at different throughput levels. “Taking a multi-precision approach can enhance the accuracy of a calculation and justify the use of mainly single-precision arithmetic (for performance) along with the occasional use of double-precision (64-bit) arithmetic for precision-sensitive operations,” writes Farber. (Rob Farber. “Numerical Precision: How Much is Enough?” ScientificComputing.com. Accessed July 1, 2008.)
This paper reports on CuPP, our newly developed C++ framework designed to ease integration of NVIDIA’s GPGPU system, CUDA, into existing C++ applications. CuPP provides interfaces to reoccurring tasks that are easier to use than the standard CUDA interfaces. In this paper we concentrate on memory management and related data structures. CuPP offers both a low level interface — mostly consisting of smart pointers and memory allocation functions for GPU memory — and a high level interface offering a C++ STL vector wrapper and the so-called type transformations. The wrapper can be used by both device and host to automatically keep data in sync. The type transformations allow developers to write their own data structures offering the same functionality as the CuPP vector, in case a vector does not conform to the need of the application. Furthermore the type transformations offer a way to have two different representations for the same data at host and device, respectively. We demonstrate the benefits of using CuPP by integrating it into an example application, the open-source steering library OpenSteer. In particular, for this application we develop a uniform grid data structure to solve the k-nearest neighbor problem that deploys the type transformations. The paper finishes with a brief outline of another CUDA application, the Einstein@Home client, which also requires data structure redesign and thus may benefit from the type transformations and future work on CuPP.
(Jens Breitbart: CuPP – A framework for easy CUDA integration, HiPS 2009 workshop with IPDPS 2009, Rome, Italy, May 2009)
This NVIDIA technical report by Sengupta, Harris, and Garland describes the design of new parallel algorithms for scan and segmented scan on GPUs. This paper describes the primitives included in the latest release of the CUDPP library.
Scan and segmented scan algorithms are crucial building blocks for a great many data-parallel algorithms. Segmented scan and related primitives also provide the necessary support for the flattening transform, which allows for nested data-parallel programs to be compiled into flat data-parallel languages. In this paper, we describe the design of efficient scan and segmented scan parallel primitives in CUDA for execution on GPUs. Our algorithms are designed using a divide-and-conquer approach that builds all scan primitives on top of a set of primitive intra-warp scan routines. We demonstrate that this design methodology results in routines that are simple, highly efficient, and free of irregular access patterns that lead to memory bank conflicts. These algorithms form the basis for current and upcoming releases of the widely used CUDPP library.
(S. Sengupta, M. Harris, and M. Garland. Efficient parallel scan algorithms for GPUs. NVIDIA Technical Report NVR-2008-003, December 2008)