Designing Efficient Sorting Algorithms for Manycore GPUs

March 1st, 2009

This IPDPS 2009 paper by Nadathur Satish, Mark Harris, and Michael Garland describes the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by NVIDIA CUDA. The radix sort described is the fastest GPU sort and the merge sort described is the fastest comparison-based GPU sort reported in the literature. The radix sort is up to 4 times faster than the graphics-based GPUSort and greater than 2 times faster than other CUDA-based radix sorts. It is also 23% faster, on average, than even a very carefully optimized multicore CPU sorting routine. To achieve this performance, the authors carefully design the algorithms to expose substantial fine-grained parallelism and decompose the computation into independent tasks that perform minimal global communication. They exploit the high-speed on-chip shared memory provided by NVIDIA’s GPU architecture and efficient data-parallel primitives, particularly parallel scan. While targeted at GPUs, these algorithms should also be well-suited for other manycore processors. (N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs. Proc. 23rd IEEE Int’l Parallel & Distributed Processing Symposium, May 2009. To appear.)

CUDPP 1.0a Adds Segmented Scan and Sparse Matrix-Vector Multiplication

April 20th, 2008

Version 1.0 alpha of CUDPP, the CUDA Data-Parallel Algorithms Library, has been released. This version adds the segmented scan algorithm and sparse matrix-vector multiplication to CUDPP’s repertoire. Other new features include an improved “plan”-based configuration interface, an improved scan algorithm for higher performance, support for more inclusive scans and more scan operators, an improved stream compaction interface. In addition, CUDPP 1.0a adds support for CUDA 2.0 and the Windows Vista and Mac OS X (10.5.2 and higher) operating systems. CUDPP works with NVIDIA CUDA versions 1.1 and higher.

CUDA Data-Parallel Primitives Library Released

November 5th, 2007

CUDPP is the CUDA Data Parallel Primitives Library for NVIDIA CUDA. CUDPP is a library of data-parallel algorithm primitives such as parallel-prefix-sum (“scan”), parallel sort and parallel reduction. Primitives such as these are important building blocks for a wide variety of data-parallel algorithms, including sorting, stream compaction, and building data structures such as trees and summed-area tables. The first beta release of CUDPP is now available, as is the searchable online documentation.

Graphics Hardware 2007 Papers

August 16th, 2007

On 4-5 August 2007, San Diego hosted the annual Graphics Hardware conference. GPGPU figured prominently in three papers:

  • As transistors get smaller, their transient failure rates increase. Future architectures must adapt to address the resulting reliability problems. Jeremy Sheaffer presented a paper demonstrating a hardware-based redundancy approach to ensure reliability on GPGPU applications. (“A Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics Processors”. Jeremy Sheaffer, University of Virginia; David Luebke, NVIDIA Research; Kevin Skadron, University of Virginia.)
  • Magnus Strengert presented a generic, minimally intrusive, and application-transparent GLSL debugger that operates transparently to the application. In it, shader debugging is performed on a per-draw call level; it allows singlestepping and the inspection of arbitrary variable content. Linux code is available and Windows code is expected by the end of the year. (“A Hardware-Aware Debugger for the OpenGL Shading Language”. Magnus Strengert, Thomas Klein, and Thomas Ertl, University of Stuttgart.)
  • One critical need for GPGPU developers is a library of general-purpose building blocks for GPU computation. Shubhabrata Sengupta presented a paper describing a GPU implementation of the “scan primitives” and their use in novel GPU implementations of quicksort, efficient sparse matrix-vector multiplication, and tridiagonal matrix systems. This paper won the Best Paper award and the authors are preparing an open-source release. (“Scan Primitives for GPU Computing”. Shubhabrata Sengupta, UC Davis; Mark Harris, NVIDIA Corporation; Yao Zhang, UC Davis; John D. Owens, UC Davis.)

All Graphics Hardware 2007 papers are available in the ACM digital library. In addition, the GH07 program page contains slides for all talks as well as two keynote talks (Chas. Boyd of the Microsoft DirectX team: “Mass Market Applications of Data-Parallel Computing” and Michael Jones, chief technologist of Google Earth: “GPUs for the true mass market”) and vendor talks from AMD and NVIDIA about their latest processors (AMD Radeon HD 2900 and NVIDIA’s Tesla).

Workshop: Data-Parallel Programming Models for Many-Core Architectures

March 7th, 2007

Data-parallel programming models are emerging as an extremely attractive model for parallel programming, driven by several factors. Through deterministic semantics and constrained synchronization mechanisms, they provide race-free parallel-programming semantics. Furthermore, data-parallel programming models free programmers from reasoning about the details of the underlying hardware and software mechanisms for achieving parallel execution and facilitate effective compilation. Finally, efforts in the GPGPU movement and elsewhere have matured implementation technologies for streaming and data-parallel programming models to the point where high performance can be reliably achieved.

This workshop gathers commercial and academic researchers, vendors, and users of data-parallel programming platforms to discuss implementation experience for a broad range of many-core architectures and to speculate on future programming-model directions. Participating institutions include AMD, Electronic Arts, Intel, Microsoft, NVIDIA, PeakStream, RapidMind, and The University of New South Wales. (Link to Call for Participation, Data-Parallel Programming Models for Many-Core Architectures)

NVIDIA Releases CUDA for GPU Computing

February 16th, 2007

A beta of NVIDIA’s CUDA development environment, NVIDIA’s new technology for computing with GPUs, is now posted on developer.nvidia.com. This beta release of CUDA contains a C compiler for the GPU and an SDK with examples to get you started coding for the GPU. From the press release:

GPU Computing with CUDA is a new approach to computing where hundreds of on-chip processors simultaneously communicate and cooperate to solve complex computing problems. Applications that require mathematically intensive computing on large amounts of data are ideal targets for GPU Computing. NVIDIA NVIDIA’s CUDA technology is available in GeForce 8800 graphics products and future NVIDIA Quadro Professional Graphics solutions based on 8-series (G8X) GPUs. Developers are invited to download the beta version of the CUDA Software Developers Kit (SDK) and C compiler for Windows XP and Linux (RedHat Release 4 Update 3) from the NVIDIA Developer Web site at developer.nvidia.com/cuda. GPU Computing Forums for news, discussion and programming tips are also available at forums.nvidia.com.

New GPGPU Tutorials added to GPGPU.org Developer Page

January 4th, 2007

We have added links to some great introductory GPGPU tutorials to the Developer Page. These tutorials, written by Dominik Göddeke from Dortmund University, cover basic GPGPU concepts, parallel reductions, and fast data transfers.

Page 2 of 212