Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid

March 3rd, 2010

Abstract:

We have previously suggested mixed precision iterative solvers specifically tailored to the iterative solution of sparse linear equation systems as they typically arise in the finite element discretization of partial differential equations. These schemes have been evaluated for a number of hardware platforms, in particular single precision GPUs as accelerators to the general purpose CPU. This paper reevaluates the situation with new mixed precision solvers that run entirely on the GPU: We demonstrate that mixed precision schemes constitute a significant performance gain over native double precision. Moreover, we present a new implementation of cyclic reduction for the parallel solution of tridiagonal systems and employ this scheme as a line relaxation smoother in our GPU-based multigrid solver. With an alternating direction implicit variant of this advanced smoother we can extend the applicability of the GPU multigrid solvers to very ill-conditioned systems arising from the discretization on anisotropic meshes, that previously had to be solved on the CPU. The resulting mixed precision schemes are always faster than double precision alone, and outperform tuned CPU solvers consistently by almost an order of magnitude.

(Dominik Göddeke and Robert Strzodka: “Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid” , accepted in: IEEE Transactions on Parallel and Distributed Systems, Special Issue: High Performance Computing with Accelerators, Mar. 2010. Link.)

CUDPP Users: Please Complete This Survey!

February 11th, 2010

The developers of the CUDPP (CUDA Data-Parallel Primitives) Library request that users (past and current) of the CUDPP Library fill out the CUDPP Survey.  This survey will help the CUDPP Team prioritize new development and support for existing and new features.

Thrust 1.1 Released

September 11th, 2009

Thrust (v1.1) is  an open-source template library for developing CUDA applications.  Modeled after the C++ Standard Template Library (STL), Thrust brings a familiar abstraction layer to the realm of GPU computing. Version 1.1 adds several new features, including:

To get started with Thrust, first download Thrust and then follow the online tutorial.  Refer to the online documentation for a complete list of features.  Many concrete examples and a set of introductory slides are also available. As the following code example shows, Thrust programs are concise and readable. Read the rest of this entry »

Efficient parallel scan algorithms for GPUs

June 24th, 2009

This NVIDIA technical report by Sengupta, Harris, and Garland describes the design of new parallel algorithms for scan and segmented scan on GPUs.   This paper describes the primitives included in the latest release of the CUDPP library.

Abstract:

Scan and segmented scan algorithms are crucial building blocks for a great many data-parallel algorithms. Segmented scan and related primitives also provide the necessary support for the flattening transform, which allows for nested data-parallel programs to be compiled into flat data-parallel languages. In this paper, we describe the design of efficient scan and segmented scan parallel primitives in CUDA for execution on GPUs. Our algorithms are designed using a divide-and-conquer approach that builds all scan primitives on top of a set of primitive intra-warp scan routines. We demonstrate that this design methodology results in routines that are simple, highly efficient, and free of irregular access patterns that lead to memory bank conflicts. These algorithms form the basis for current and upcoming releases of the widely used CUDPP library.

(S. Sengupta, M. Harris, and M. Garland. Efficient parallel scan algorithms for GPUs. NVIDIA Technical Report NVR-2008-003, December 2008)

Fast and Scalable List Ranking on the GPU

April 28th, 2009

Abstract from the paper by Rehman et al.:

General purpose programming on graphics processing units (GPGPU) has received a lot of attention in the parallel computing community as it promises to offer the highest performance per dollar. While GPUs are usually used to tackle regular problems that can be easily parallelized, we describe two implementations of List Ranking—a traditional irregular algorithm that is difficult to parallelize on such massively multi-threaded hardware. In our best implementation, we introduce a GPU-optimized, recursive version of the Helman-JaJa algorithm. Our implementation can rank a random list of 8 million elements in just over 100 milliseconds, and achieves a speedup of about 8-9 over a CPU implementation as well as a speedup of 3-4 over the best reported implementation on the Cell Broadband Engine. We also discuss some practical issues that come to the fore when working with massively multi-threaded architectures, especially for algorithms with highly irregular memory access patterns. (M. Suhail Rehman, K. Kothapalli, P.J. Narayanan. Fast and Scalable List Ranking on the GPU. 23rd International Conference on Supercomputing (ICS). New York, USA, June 2009. (To Appear))

Designing Efficient Sorting Algorithms for Manycore GPUs

March 1st, 2009

This IPDPS 2009 paper by Nadathur Satish, Mark Harris, and Michael Garland describes the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by NVIDIA CUDA. The radix sort described is the fastest GPU sort and the merge sort described is the fastest comparison-based GPU sort reported in the literature. The radix sort is up to 4 times faster than the graphics-based GPUSort and greater than 2 times faster than other CUDA-based radix sorts. It is also 23% faster, on average, than even a very carefully optimized multicore CPU sorting routine. To achieve this performance, the authors carefully design the algorithms to expose substantial fine-grained parallelism and decompose the computation into independent tasks that perform minimal global communication. They exploit the high-speed on-chip shared memory provided by NVIDIA’s GPU architecture and efficient data-parallel primitives, particularly parallel scan. While targeted at GPUs, these algorithms should also be well-suited for other manycore processors. (N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs. Proc. 23rd IEEE Int’l Parallel & Distributed Processing Symposium, May 2009. To appear.)

Ph.D. Dissertation: Glift Generic GPU Data Structures, by Aaron Lefohn

January 18th, 2007

This Ph.D. dissertation by Aaron Lefohn at the University of California, Davis describes the Glift GPU data structure abstraction and its application to both GPU-based data-parallel and interactive rendering algorithms. The applications include octree 3D painting, adaptive shadow maps, resolution matched shadow maps, heat-diffusion depth-of-field, and a GPU-based direct solver for tridiagonal linear systems. While much of this work has been posted previously, this dissertation contains a more in-depth discussion of the Glift data structure library and introduces several GPGPU and rendering algorithms that are not yet published. This dissertation demonstrates that a data structure abstraction for GPUs can simplify the description of new and existing data structures, stimulate development of complex GPU algorithms, and perform equivalently to hand-coded implementations. The dissertation also presents a case that future interactive rendering solutions will be an inseparable mix of general-purpose, data-parallel algorithms and traditional graphics programming. (Aaron Lefohn, “Glift: Generic Data Structures for Graphics Hardware”, Ph.D. dissertation, Computer Science Department, University of California Davis, September 2006.)

Interactive Depth of Field Using Simulated Diffusion on a GPU

January 18th, 2007

This Pixar Animation Studios Technical Report by Kass, Lefohn, and Owens describes a GPU-based data-parallel direct tridiagonal linear solver. To the authors’ knowledge, this is the first reported direct, linear-time tridiagonal GPU solver. The solver is used to implement a new heat-diffusion-based depth-of-field preview algorithm; and the paper describes solving thousands of tridiagonal systems, each with hundreds of elements, on the GPU at interactive rendering rates. The alternating direction implicit solution gives rise to separable spatially varying recursive (infinite-impulse response, IIR) filters that can compute large-kernel convolutions in constant time per pixel while respecting the boundaries between in-focus and out-of-focus objects. Recursive filters have traditionally been viewed as problematic for GPUs, but using the well-established method of cyclic reduction of tridiagonal systems, the authors are able to parallelize the computation and implement an efficient solution in terms of GPGPU primitives. (Michael Kass, Aaron Lefohn, and John Owens. Interactive Depth of Field Using Simulated Diffusion on the GPU, Technical Report #06-01, Pixar Animation Studios, January 2006.)

GPU-ABiSort: Optimal Parallel Sorting on Stream Architectures

August 14th, 2006

This paper presents a novel approach for parallel sorting on stream processing architectures. It is based on adaptive bitonic sorting. For sorting n values utilizing p stream processor units, this approach achieves the optimal time complexity O((n log n)/p). This approach is competitive with common sequential sorting algorithms not only from a theoretical viewpoint, it is also very fast from a practical viewpoint. The paper presents an implementation on modern programmable graphics hardware (GPUs). On recent GPUs this optimal parallel sorting approach has shown to be remarkably faster than sequential sorting on the CPU, and it is also faster than previous non-optimal sorting approaches on the GPU for sufficiently large input sequences. (GPU-ABiSort: Optimal Parallel Sorting on Stream Architectures Alexander Gress and Gabriel Zachmann. Proc. 20th IEEE Int’l Parallel and Distributed Processing Symposium (IPDPS), 2006.)

A work-efficient step-efficient prefix-sum algorithm

June 22nd, 2006

This extended abstract by Sengupta et al. presents a work-efficient step-efficient prefix-sum algorithm. This algorithm achieves a three to four fold speedup over the step-efficient prefix-sum algorithm presented by Daniel Horn in GPU Gems 2. It can also be tuned to efficiently run on future hardware which would have a higher degree of parallelism. (A work-efficient step-efficient prefix-sum algorithm. Shubhabrata Sengupta, Aaron E. Lefohn, John D. Owens in in Proceedings of the 2006 Workshop on Edge Computing Using New Commodity Architectures.)