New CLOGS library with sort and scan primitives for OpenCL

February 5th, 2012

CLOGS is a library for higher-level operations on top of the OpenCL C++ API. It is designed to integrate with other OpenCL code, including synchronization using OpenCL events. Currently only two operations are supported: radix sorting and exclusive scan. Radix sort supports all the unsigned integral types as keys, and all the built-in scalar and vector types suitable for storage in buffers as values. Scan supports all the integral types. It also supports vector types, which allows for limited multi-scan capabilities.

Version 1.0 of the library has just been released. The home page is

OpenCL Parallel Primitives Library

June 3rd, 2011

clpp is an OpenCL library of data-parallel algorithm primitives such as parallel prefix sum (“scan”), parallel sort and parallel reduction. Primitives such as these are important building blocks for a wide variety of data-parallel algorithms, including sorting, stream compaction, and building data structures such as trees and summed-area tables. For more information, visit

Back 40 Computing: High Performance GPU Building Blocks

August 22nd, 2010

The Back 40 Computing project aims at providing a collection of high performance GPU computing building blocks. It is maintained by Duane Merrill from the University of Virginia. Highlights of the current release include the fastest  Radix Sort implementation on GPUs to date, capable of sorting over 1 billion keys per second. For more details you can also see this (pre-Fermi) Techreport (direct PDF link).

Source code and documentation are available on Google Code.

CUDPP 1.1.1

April 29th, 2010

The CUDA Data Parallel Primitives Library (CUDPP) is a cross-platform, open-source library of data-parallel algorithm primitives such as parallel prefix-sum (“scan”), parallel sort and parallel reduction. Primitives such as these are important building blocks for a wide variety of data-parallel algorithms, including sorting, stream compaction, and building data structures such as trees and summed-area tables. CUDPP runs on processors that support CUDA.

CUDPP release 1.1.1 is a bugfix release with fixes for scan, segmented scan, stream compaction, and radix sort on the NVIDIA Fermi (sm_20) architecture, including GeForce 400 series and Tesla 20 series GPUs.  It also includes improvements and bugfixes for radix sorts on 64-bit OSes, and fixes for 64-bit builds on MS Windows OSes and Apple OS X 10.6 (Snow Leopard).  Change Log.

Thrust 1.1 Released

September 11th, 2009

Thrust (v1.1) is  an open-source template library for developing CUDA applications.  Modeled after the C++ Standard Template Library (STL), Thrust brings a familiar abstraction layer to the realm of GPU computing. Version 1.1 adds several new features, including:

To get started with Thrust, first download Thrust and then follow the online tutorial.  Refer to the online documentation for a complete list of features.  Many concrete examples and a set of introductory slides are also available. As the following code example shows, Thrust programs are concise and readable. Read the rest of this entry »

Designing Efficient Sorting Algorithms for Manycore GPUs

March 1st, 2009

This IPDPS 2009 paper by Nadathur Satish, Mark Harris, and Michael Garland describes the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by NVIDIA CUDA. The radix sort described is the fastest GPU sort and the merge sort described is the fastest comparison-based GPU sort reported in the literature. The radix sort is up to 4 times faster than the graphics-based GPUSort and greater than 2 times faster than other CUDA-based radix sorts. It is also 23% faster, on average, than even a very carefully optimized multicore CPU sorting routine. To achieve this performance, the authors carefully design the algorithms to expose substantial fine-grained parallelism and decompose the computation into independent tasks that perform minimal global communication. They exploit the high-speed on-chip shared memory provided by NVIDIA’s GPU architecture and efficient data-parallel primitives, particularly parallel scan. While targeted at GPUs, these algorithms should also be well-suited for other manycore processors. (N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs. Proc. 23rd IEEE Int’l Parallel & Distributed Processing Symposium, May 2009. To appear.)

A Memory Model for Scientic Algorithms on Graphics Processors

October 4th, 2006

This Supercomputing 2006 paper by Govindaraju et al. presents a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). The memory model is based on texturing hardware, which uses a 2D block-based array representation to perform the underlying computations. It incorporates many characteristics of GPU architectures including smaller cache sizes, 2D block representations, and uses the 3C’s model to analyze the cache misses. Moreover, the paper presents techniques to improve the performance of nested loops on GPUs. In order to demonstrate the effectiveness of the model, the paper highlights its performance on three memory-intensive scientific applications: sorting, Fast Fourier Transform and dense matrix multiplication. In practice, their cache-efficient algorithms for these applications are able to achieve memory throughput of 30-50 GB/s on an NVIDIA 7900 GTX GPU. The paper also compares its results with prior GPU-based and CPU-based implementations on high-end processors. In practice, they are able to achieve 2x-5x performance improvement. (A Memory Model for Scientic Algorithms on Graphics Processors)

GPU-ABiSort: Optimal Parallel Sorting on Stream Architectures

August 14th, 2006

This paper presents a novel approach for parallel sorting on stream processing architectures. It is based on adaptive bitonic sorting. For sorting n values utilizing p stream processor units, this approach achieves the optimal time complexity O((n log n)/p). This approach is competitive with common sequential sorting algorithms not only from a theoretical viewpoint, it is also very fast from a practical viewpoint. The paper presents an implementation on modern programmable graphics hardware (GPUs). On recent GPUs this optimal parallel sorting approach has shown to be remarkably faster than sequential sorting on the CPU, and it is also faster than previous non-optimal sorting approaches on the GPU for sufficiently large input sequences. (GPU-ABiSort: Optimal Parallel Sorting on Stream Architectures Alexander Gress and Gabriel Zachmann. Proc. 20th IEEE Int’l Parallel and Distributed Processing Symposium (IPDPS), 2006.)

GPUTeraSort: High Performance Graphics Coprocessor Sorting for Large Database Management

April 4th, 2006

GPUTeraSort sorts billion-record wide-key databases using the data and task parallelism on the graphics processing unit (GPU) to perform memory-intensive and compute-intensive tasks while the CPU performs I/O and resource management. It exploits both the high-bandwidth GPU memory interface and the lower-bandwidth CPU main memory interface to achieve higher aggregate memory bandwidth than purely CPU-based algorithms. It also pipelines disk transfers to achieve near-peak I/O performance. GPUTera-Sort is a two-phase task pipeline: (1) read disk, build keys, sort using the GPU, generate runs, write disk, and (2) read, merge, write. We tested the performance of GPUTeraSort on billion-record files using the standard Sort benchmark. In practice, a 3 GHz Pentium IV PC with $265 NVIDIA 7800 GT GPU is significantly faster than optimized CPU-based algorithms on much faster processors, sorting 60GB for a penny; the best reported PennySort price-performance. These results suggest that a GPU co-processor can significantly improve performance on large data processing tasks. (GPUTeraSort: High Performance Graphics Coprocessor Sorting for Large Database Management. Naga K. Govindaraju, Jim Gray, Ritesh Kumar, and Dinesh Manocha. Proceedings of ACM SIGMOD 2006.)

High Performance Sorting on a GPU

July 1st, 2005

This paper by Govindaraju et al. describes a cache-efficient bitonic sorting algorithm on GPUs. The algorithm uses the special purpose texture mapping and programmable hardware to sort IEEE 32-bit floating point data including pointers, and has been used to perform stream data mining and relational database queries. Their results indicate a significant performance improvement over prior CPU-based and GPU-based sorting algorithms. ( GPUSORT: A High Performance Sorting Library”. Also see this Tom’s Hardware article)