CUDPP 2.0: parallel hash tables, tridiagonal solver, parallel reductions, and double precision

August 8th, 2011

CUDPP release 2.0 is a major new release of the CUDA Data-Parallel Primitives Library, with exciting new features. The public interface has undergone a minor redesign to provide thread safety. Parallel reductions (cudppReduce) and a tridiagonal system solver (cudppTridiagonal) have been added, and a new component library, cudpp_hash, provides fast data-parallel hash table functionality. In addition, support for 64-bit data types (double as well as long long and unsigned long long) has been added to all CUDPP algorithms, and a variety of bugs have been fixed.  For a complete list of changes, see the change log. CUDPP 2.0 is available for download now.

Thrust v1.3 release

October 7th, 2010

Thrust v1.3, an open-source template library for CUDA applications, has been released. Modeled after the C++ Standard Template Library (STL), Thrust brings a familiar abstraction layer to the realm of GPU computing.

Version 1.3 adds several new features, including:

  • a state-of-the-art sorting implementation, recently featured on Slashdot.
  • performance improvements to stream compaction and reduction
  • robust error reporting and failure detection
  • support for CUDA 3.2 and gf104-based GPUs
  • search algorithms
  • and more!

Get started with Thrust today! First download Thrust v1.3 and then follow the online quick-start guide. Refer to the online documentation for a complete list of features. Many concrete examples and a set of introductory slides are also available. Read the rest of this entry »

CUDPP 1.1.1

April 29th, 2010

The CUDA Data Parallel Primitives Library (CUDPP) is a cross-platform, open-source library of data-parallel algorithm primitives such as parallel prefix-sum (“scan”), parallel sort and parallel reduction. Primitives such as these are important building blocks for a wide variety of data-parallel algorithms, including sorting, stream compaction, and building data structures such as trees and summed-area tables. CUDPP runs on processors that support CUDA.

CUDPP release 1.1.1 is a bugfix release with fixes for scan, segmented scan, stream compaction, and radix sort on the NVIDIA Fermi (sm_20) architecture, including GeForce 400 series and Tesla 20 series GPUs.  It also includes improvements and bugfixes for radix sorts on 64-bit OSes, and fixes for 64-bit builds on MS Windows OSes and Apple OS X 10.6 (Snow Leopard).  Change Log.

CUDPP Users: Please Complete This Survey!

February 11th, 2010

The developers of the CUDPP (CUDA Data-Parallel Primitives) Library request that users (past and current) of the CUDPP Library fill out the CUDPP Survey.  This survey will help the CUDPP Team prioritize new development and support for existing and new features.

Some older publications worth reading

January 17th, 2010

Occasionally, we receive news submissions pointing us to interesting older papers that somehow slipped by without our notice. This post collects a few of those. If you want your work to be posted on  in a timely manner, please remember to use the news submission form.

  • Joshua A. Anderson, Chris D. Lorenz and Alex Travesset present and discuss molecular dynamics simulations and compare a single GPU against a 36-CPU cluster (General purpose molecular dynamics simulations fully implemented on graphics processing units, Journal of Computational Physics 227(10), May 2008, DOI 10.1016/
  • Wen-mei Hwu et al. derive and discuss goals and concepts of programming models for fine-grained parallel architectures, from the point of view of both a programmer and a hardware /compiler designer, and analyze CUDA as one current representative  (Implicitly parallel programming models for thousand-core microprocessors, Proceedings of DAC’07, June 2007, DOI 10.1145/1278480.1278669).
  • Jeremy Sugerman et al. present GRAMPS, a prototype implementation of future graphics hardware that allows pipelines to be specified as graphs in software (GRAMPS: A Programming Model for Graphics Pipelines, ACM Transactions on Graphics 28(1), January 2009, DOI 10.1145/1477926.1477930).
  • William R. Mark discusses concepts of future graphics architectures in this contribution to the 2008 ACM Queue special issue on GPUs (Future graphics architectures, ACM Queue 6(2), March/April 2008,  DOI 10.1145/1365490.1365501).
  • BSGP by Qiming Hou et al. is a new programming language for general purpose GPU computing that achieves the same efficiency as well-tuned CUDA programs but makes code much easier to read, develop and maintain (BSGP: bulk-synchronous GPU programming, ACM Siggraph 2008, August 2008, DOI 10.1145/1399504.1360618).
  • Finally, Che et al. and Garland et al. survey the field of GPU computing and discuss many different application domains. These articles are, in addition to the ones we have collected on the developer pages, recommended to GPGPU newcomers.

Thrust 1.1 Released

September 11th, 2009

Thrust (v1.1) is  an open-source template library for developing CUDA applications.  Modeled after the C++ Standard Template Library (STL), Thrust brings a familiar abstraction layer to the realm of GPU computing. Version 1.1 adds several new features, including:

To get started with Thrust, first download Thrust and then follow the online tutorial.  Refer to the online documentation for a complete list of features.  Many concrete examples and a set of introductory slides are also available. As the following code example shows, Thrust programs are concise and readable. Read the rest of this entry »

Intel acquires RapidMind

August 23rd, 2009

Intel has acquired RapidMind, the company behind the RapidMind (formerly Sh) programming environment targeting multicore CPUs, AMD and NVIDIA GPUs and the Cell processor. The RapidMind Platform continues to be available, including support. In the medium term RapidMind’s technology and products will be integrated with Intel’s data-parallel products, in particular Intel’s Ct technology.

This blog entry by James Reinders from Intel describes the acquisition and future plans in more detail.

CUDPP 1.1 Now Available

July 1st, 2009

Release 1.1 of the CUDA Data-Parallel Primitives Library (CUDPP) is now available for download.  The two major new features in CUDPP 1.1 are a very fast new radix sort implementation with support for sorting key-value pairs (with float or unsigned integer keys); and a new pseudorandom number generator, cudppRand. CUDPP 1.1 also replaces its former custom license with the standard BSD license. This greatly simplifies the CUDPP license details, and it also enables CUDPP to move into a public source repository such as Google Code in the near future. For more information, visit the CUDPP Website.

Efficient parallel scan algorithms for GPUs

June 24th, 2009

This NVIDIA technical report by Sengupta, Harris, and Garland describes the design of new parallel algorithms for scan and segmented scan on GPUs.   This paper describes the primitives included in the latest release of the CUDPP library.


Scan and segmented scan algorithms are crucial building blocks for a great many data-parallel algorithms. Segmented scan and related primitives also provide the necessary support for the flattening transform, which allows for nested data-parallel programs to be compiled into flat data-parallel languages. In this paper, we describe the design of efficient scan and segmented scan parallel primitives in CUDA for execution on GPUs. Our algorithms are designed using a divide-and-conquer approach that builds all scan primitives on top of a set of primitive intra-warp scan routines. We demonstrate that this design methodology results in routines that are simple, highly efficient, and free of irregular access patterns that lead to memory bank conflicts. These algorithms form the basis for current and upcoming releases of the widely used CUDPP library.

(S. Sengupta, M. Harris, and M. Garland. Efficient parallel scan algorithms for GPUs. NVIDIA Technical Report NVR-2008-003, December 2008)

Fast and Scalable List Ranking on the GPU

April 28th, 2009

Abstract from the paper by Rehman et al.:

General purpose programming on graphics processing units (GPGPU) has received a lot of attention in the parallel computing community as it promises to offer the highest performance per dollar. While GPUs are usually used to tackle regular problems that can be easily parallelized, we describe two implementations of List Ranking—a traditional irregular algorithm that is difficult to parallelize on such massively multi-threaded hardware. In our best implementation, we introduce a GPU-optimized, recursive version of the Helman-JaJa algorithm. Our implementation can rank a random list of 8 million elements in just over 100 milliseconds, and achieves a speedup of about 8-9 over a CPU implementation as well as a speedup of 3-4 over the best reported implementation on the Cell Broadband Engine. We also discuss some practical issues that come to the fore when working with massively multi-threaded architectures, especially for algorithms with highly irregular memory access patterns. (M. Suhail Rehman, K. Kothapalli, P.J. Narayanan. Fast and Scalable List Ranking on the GPU. 23rd International Conference on Supercomputing (ICS). New York, USA, June 2009. (To Appear))