CUDPP Documentation
1.1
CUDPP is the CUDA Data Parallel Primitives Library. CUDPP is a library of data-parallel algorithm primitives such as parallel-prefix-sum ("scan"), parallel sort and parallel reduction. Primitives such as these are important building blocks for a wide variety of data-parallel algorithms, including sorting, stream compaction, and building data structures such as trees and summed-area tables.
Homepage for CUDPP:
http://www.gpgpu.org/developer/cudpp/
Announcements and discussion of CUDPP are hosted on the CUDPP Google Group.
You may want to start by browsing the
CUDPP Public Interface. For information on building CUDPP, see
Building CUDPP.
The "apps" subdirectory included with CUDPP has a few source code samples that use CUDPP:
- simpleCUDPP, a simple example of using cudppScan()
- satGL, an example of using cudppMultiScan() to generate a summed-area table (SAT) of a scene rendered in real time. The SAT is then used to simulate depth of field blur.
- cudpp_testrig, a comprehensive test application for all the functionality of CUDPP
We have also provided a code walkthrough of the simpleCUDPP example.
To get help using CUDPP, please use the
CUDPP Google Group.
To report CUDPP bugs or request features, you may use either the above CUDPP Google Group, or you can file an issue directly using Google Code.
For specific release details see the
Change Log.
This release (1.1) has been thoroughly tested on the following OSes.
- Windows XP (32-bit) (CUDA 2.2)
- Windows Vista (32-bit) (CUDA 2.2)
- Redhat Enterprise Linux 5 (64-bit) (CUDA 2.2)
- Ubuntu Linux 8.04 (32-bit and 64-bit) (CUDA 2.2)
- and Mac OS X 10.5.7 (Leopard, 32-bit) (CUDA 2.2)
We expect CUDPP to build and run correctly on other flavors of Linux, but these are not actively tested by the developers at this time.
Note: CUDPP is not compatible with CUDA 2.1. A compiler bug in 2.1 causes the compiler to crash.
CUDPP is implemented in
C for CUDA. It requires the CUDA Toolkit version 2.2 or later. Please see the NVIDIA
CUDA homepage to download CUDA as well as the CUDA Programming Guide and CUDA SDK, which includes many CUDA code examples. Two of the samples in the CUDA SDK ("marchingCubes" and "lineOfSight") also use CUDPP.
Design goals for CUDPP include:
- Performance. We aim to provide best-of-class performance for our primitives. We welcome suggestions and contributions that will improve CUDPP performance. We also want to provide primitives that can be easily benchmarked, and compared against other implementations on GPUs and other processors.
- Modularity. We want our primitives to be easily included in other applications. To that end we have made the following design decisions:
- CUDPP is provided as a library that can link against other applications.
- CUDPP calls run on the GPU on GPU data. Thus they can be used as standalone calls on the GPU (on GPU data initialized by the calling application) and, more importantly, as GPU components in larger CPU/GPU applications.
- CUDPP is implemented as 4 layers:
- The Public Interface is the external library interface, which is the intended entry point for most applications. The public interface calls into the Application-Level API.
- The Application-Level API comprises functions callable from CPU code. These functions execute code jointly on the CPU (host) and the GPU by calling into the Kernel-Level API below them.
- The Kernel-Level API comprises functions that run entirely on the GPU across an entire grid of thread blocks. These functions may call into the CTA-Level API below them.
- The CTA-Level API comprises functions that run entirely on the GPU within a single Cooperative Thread Array (CTA, aka thread block). These are low-level functions that implement core data-parallel algorithms, typically by processing data within shared (CUDA
__shared__) memory.
Programmers may use any of the lower three CUDPP layers in their own programs by building the source directly into their application. However, the typical usage of CUDPP is to link to the library and invoke functions in the CUDPP Public Interface, as in the simpleCUDPP, satGL, and cudpp_testrig application examples included in the CUDPP distribution.
In the future, if and when CUDA supports building device-level libraries, we hope to enhance CUDPP to ease the use of CUDPP internal algorithms at all levels.
We expect the normal use of CUDPP will be in one of two ways:
- Linking the CUDPP library against another application.
- Running our "test" application, cudpp_testrig, that exercises CUDPP functionality.
The following publications describe work incorporated in CUDPP.
- Mark Harris, Shubhabrata Sengupta, and John D. Owens. "Parallel Prefix Sum (Scan) with CUDA". In Hubert Nguyen, editor, GPU Gems 3, chapter 39, pages 851–876. Addison Wesley, August 2007. http://graphics.idav.ucdavis.edu/publications/print_pub?pub_id=916
- Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens. "Scan Primitives for GPU Computing". In Graphics Hardware 2007, pages 97–106, August 2007. http://graphics.idav.ucdavis.edu/publications/print_pub?pub_id=915
- Shubhabrata Sengupta, Mark Harris, and Michael Garland. "Efficient parallel scan algorithms for GPUs". NVIDIA Technical Report NVR-2008-003, December 2008. http://mgarland.org/papers.html#segscan-tr
- Nadathur Satish, Mark Harris, and Michael Garland. "Designing Efficient Sorting Algorithms for Manycore GPUs". Proc. 23rd IEEE Int’l Parallel & Distributed Processing Symposium, May 2009. http://mgarland.org/papers.html#gpusort
- Stanley Tzeng, Li-Yi Wei. "Parallel white noise generation on a GPU via cryptographic hash". Proc. 2008 symposium on Interactive 3D graphics and games. pages 79–87. http://research.microsoft.com/apps/pubs/default.aspx?id=70502
Many researchers are using CUDPP in their work, and there are many publications that have used it (references). If your work uses CUDPP, please let us know by sending us a BibTeX reference to your work.
If you make use of CUDPP primitives in your work and want to cite CUDPP (thanks!), we would prefer if you would cite the appropriate papers above, since they form the core of CUDPP. To be more specific, the GPU Gems paper describes (unsegmented) scan and multi-scan for summed-area tables. The NVIDIA technical report describes the current scan and segmented scan algorithms used in the library, and the Graphics Hardware paper describes an earlier implementation of segmented scan, quicksort, and sparse matrix-vector multiply. The IPDPS paper describes the radix sort used in CUDPP, and the I3D paper describes the random number generation algorithm.
- Mark Harris, NVIDIA Corporation
- John D. Owens, University of California, Davis
- Shubho Sengupta, University of California, Davis
- Stanley Tseng, University of California, Davis
- Yao Zhang, University of California, Davis
- Andrew Davidson, University of California, Davis (formerly Louisiana State University)
Thanks to Jim Ahrens, Timo Aila, Ian Buck, Guy Blelloch, Jeff Bolz, Michael Garland, Jeff Inman, Eric Lengyel, Samuli Laine, David Luebke, Pat McCormick, and Richard Vuduc for their contributions during the development of this library.
CUDPP Developers from UC Davis thank their funding agencies:
- Department of Energy Early Career Principal Investigator Award DE-FG02-04ER25609
- SciDAC Institute for Ultrascale Visualization (http://www.iusv.org/)
- Los Alamos National Laboratory
- National Science Foundation (grant 0541448)
- Generous hardware donations from NVIDIA
CUDPP is copyright The Regents of the University of California, Davis campus and NVIDIA Corporation. The library, examples, and all source code are released under the BSD license, designed to encourage reuse of this software in other projects, both commercial and non-commercial. For details, please see the
CUDPP License page.
Note that prior to release 1.1 of CUDPP, the license used was a modified BSD license. With release 1.1, this license was replaced with the pure BSD license to facilitate the use of open source hosting of the code.