This publication describes efficient low level algorithms for performing relational queries on parallel processors, such as NVIDIA Fermi or Kepler. Relations are stored in GPU memory as sorted arrays of tuples, and manipulated by relational operators that preserve the sorted property. Most significantly, this work introduces algorithms for JOIN and SET INTERSECTION/UNION/DIFFERENCE that can process data at over 50 GB/s.
Relational databases remain an important application domain for organizing and analyzing the massive volume of data generated as sensor technology, retail and inventory transactions, social media, computer vision, and new fields continue to evolve. At the same time, processor architectures are beginning to shift towards hierarchical and parallel architectures employing throughput-optimized memory systems, lightweight multi-threading, and Single-Instruction Multiple-Data (SIMD) core organizations. Examples include general purpose graphics processing units (GPUs) such as NVIDIA’s Fermi, Intels Sandy Bridge, and AMD’s Fusion processors. This paper explores the mapping of primitive relational algebra operations onto GPUs. In particular, we focus on algorithms and data structure design identifying a fundamental conflict between the structure of algorithms with good computational complexity and that of algorithms with memory access patterns and instruction schedules that achieve peak machine utilization. To reconcile this conflict, our design space exploration converges on a hybrid multi-stage algorithm that devotes a small amount of the total runtime to prune input data sets using an irregular algorithm with good computational complexity. The partial results are then fed into a regular algorithm that achieves near peak machine utilization. The design process leading to the most efficient algorithm for each stage is described, detailing alternative implementations, their performance characteristics, and an explanation of why they were ultimately abandoned. The least efficient algorithm (JOIN) achieves 57% − 72% of peak machine performance depending on the density of the input. The most efficient algorithms (PRODUCT, PROJECT, and SELECT) achieve 86% − 92% of peak machine performance across all input data sets. To the best of our knowledge, these represent the best known published results to date for any implementations. This work lays the foundation for the development of a relational database system that achieves good scalability on a Multi-Bulk-Synchronous-Parallel (M-BSP) processor architecture. Additionally, the algorithm design may offer insights into the design of parallel and distributed relational database systems. It leaves the problems of query planning, operator→query synthesis, corner case optimization, and system/OS interaction as future work that would be necessary for commercial deployment.
(Gregory Diamos, Ashwin Lele, Jin Wang, Sudhakar Yalamanchili: “Relational Algorithms for Multi-Bulk-Synchronous Processors “, NVIDIA Tech Report, June 2012. [WWW])
Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter.
We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations both CPU and GPU platforms.
(Duane Merrill, Michael Garland and Andrew Grimshaw: “Scalable GPU graph traversal”, Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming (PPoPP’12), pp.117-128, Feburary 2012. [DOI])
clpp is an OpenCL library of data-parallel algorithm primitives such as parallel prefix sum (“scan”), parallel sort and parallel reduction. Primitives such as these are important building blocks for a wide variety of data-parallel algorithms, including sorting, stream compaction, and building data structures such as trees and summed-area tables. For more information, visit http://code.google.com/p/clpp.
Today NVIDIA announced the upcoming 4.0 release of CUDA. While most of the major CUDA releases accompanied a new GPU architecture, 4.0 is a software-only release, but that doesn’t mean there aren’t a lot of new features. With this release, NVIDIA is aiming to lower the barrier to entry to parallel programming on GPUs, with new features including easier multi-GPU programming, a unified virtual memory address space, the powerful Thrust C++ template library, and automatic performance analysis in the Visual Profiler tool. Full details follow in the quoted press release below.
Read the rest of this entry »
Thrust v1.3, an open-source template library for CUDA applications, has been released. Modeled after the C++ Standard Template Library (STL), Thrust brings a familiar abstraction layer to the realm of GPU computing.
Version 1.3 adds several new features, including:
- a state-of-the-art sorting implementation, recently featured on Slashdot.
- performance improvements to stream compaction and reduction
- robust error reporting and failure detection
- support for CUDA 3.2 and gf104-based GPUs
- search algorithms
- and more!
Get started with Thrust today! First download Thrust v1.3 and then follow the online quick-start guide. Refer to the online documentation for a complete list of features. Many concrete examples and a set of introductory slides are also available. Read the rest of this entry »
We have previously suggested mixed precision iterative solvers specifically tailored to the iterative solution of sparse linear equation systems as they typically arise in the finite element discretization of partial differential equations. These schemes have been evaluated for a number of hardware platforms, in particular single precision GPUs as accelerators to the general purpose CPU. This paper reevaluates the situation with new mixed precision solvers that run entirely on the GPU: We demonstrate that mixed precision schemes constitute a significant performance gain over native double precision. Moreover, we present a new implementation of cyclic reduction for the parallel solution of tridiagonal systems and employ this scheme as a line relaxation smoother in our GPU-based multigrid solver. With an alternating direction implicit variant of this advanced smoother we can extend the applicability of the GPU multigrid solvers to very ill-conditioned systems arising from the discretization on anisotropic meshes, that previously had to be solved on the CPU. The resulting mixed precision schemes are always faster than double precision alone, and outperform tuned CPU solvers consistently by almost an order of magnitude.
(Dominik Göddeke and Robert Strzodka: “Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid” , accepted in: IEEE Transactions on Parallel and Distributed Systems, Special Issue: High Performance Computing with Accelerators, Mar. 2010. Link.)
The developers of the CUDPP (CUDA Data-Parallel Primitives) Library request that users (past and current) of the CUDPP Library fill out the CUDPP Survey. This survey will help the CUDPP Team prioritize new development and support for existing and new features.
Thrust (v1.1) is an open-source template library for developing CUDA applications. Modeled after the C++ Standard Template Library (STL), Thrust brings a familiar abstraction layer to the realm of GPU computing. Version 1.1 adds several new features, including:
To get started with Thrust, first download Thrust and then follow the online tutorial. Refer to the online documentation for a complete list of features. Many concrete examples and a set of introductory slides are also available. As the following code example shows, Thrust programs are concise and readable. Read the rest of this entry »
This NVIDIA technical report by Sengupta, Harris, and Garland describes the design of new parallel algorithms for scan and segmented scan on GPUs. This paper describes the primitives included in the latest release of the CUDPP library.
Scan and segmented scan algorithms are crucial building blocks for a great many data-parallel algorithms. Segmented scan and related primitives also provide the necessary support for the flattening transform, which allows for nested data-parallel programs to be compiled into flat data-parallel languages. In this paper, we describe the design of efficient scan and segmented scan parallel primitives in CUDA for execution on GPUs. Our algorithms are designed using a divide-and-conquer approach that builds all scan primitives on top of a set of primitive intra-warp scan routines. We demonstrate that this design methodology results in routines that are simple, highly efficient, and free of irregular access patterns that lead to memory bank conflicts. These algorithms form the basis for current and upcoming releases of the widely used CUDPP library.
(S. Sengupta, M. Harris, and M. Garland. Efficient parallel scan algorithms for GPUs. NVIDIA Technical Report NVR-2008-003, December 2008)
Abstract from the paper by Rehman et al.:
General purpose programming on graphics processing units (GPGPU) has received a lot of attention in the parallel computing community as it promises to offer the highest performance per dollar. While GPUs are usually used to tackle regular problems that can be easily parallelized, we describe two implementations of List Ranking—a traditional irregular algorithm that is difficult to parallelize on such massively multi-threaded hardware. In our best implementation, we introduce a GPU-optimized, recursive version of the Helman-JaJa algorithm. Our implementation can rank a random list of 8 million elements in just over 100 milliseconds, and achieves a speedup of about 8-9 over a CPU implementation as well as a speedup of 3-4 over the best reported implementation on the Cell Broadband Engine. We also discuss some practical issues that come to the fore when working with massively multi-threaded architectures, especially for algorithms with highly irregular memory access patterns. (M. Suhail Rehman, K. Kothapalli, P.J. Narayanan. Fast and Scalable List Ranking on the GPU. 23rd International Conference on Supercomputing (ICS). New York, USA, June 2009. (To Appear))