You are here: Home » Archives for Data-Parallel
April 29th, 2010
The CUDA Data Parallel Primitives Library (CUDPP) is a cross-platform, open-source library of data-parallel algorithm primitives such as parallel prefix-sum (“scan”), parallel sort and parallel reduction. Primitives such as these are important building blocks for a wide variety of data-parallel algorithms, including sorting, stream compaction, and building data structures such as trees and summed-area tables. CUDPP runs on processors that support CUDA.
CUDPP release 1.1.1 is a bugfix release with fixes for scan, segmented scan, stream compaction, and radix sort on the NVIDIA Fermi (sm_20) architecture, including GeForce 400 series and Tesla 20 series GPUs. It also includes improvements and bugfixes for radix sorts on 64-bit OSes, and fixes for 64-bit builds on MS Windows OSes and Apple OS X 10.6 (Snow Leopard). Change Log.
Posted in Developer Resources, Research | Tags: Data-Parallel, NVIDIA CUDA, Sorting | Write a comment
February 11th, 2010
The developers of the CUDPP (CUDA Data-Parallel Primitives) Library request that users (past and current) of the CUDPP Library fill out the CUDPP Survey. This survey will help the CUDPP Team prioritize new development and support for existing and new features.
Posted in Developer Resources | Tags: Data-Parallel, Libraries, NVIDIA CUDA, Parallel Algorithms, Surveys | Write a comment
January 17th, 2010
Occasionally, we receive news submissions pointing us to interesting older papers that somehow slipped by without our notice. This post collects a few of those. If you want your work to be posted on GPGPU.org in a timely manner, please remember to use the news submission form.
- Joshua A. Anderson, Chris D. Lorenz and Alex Travesset present and discuss molecular dynamics simulations and compare a single GPU against a 36-CPU cluster (General purpose molecular dynamics simulations fully implemented on graphics processing units, Journal of Computational Physics 227(10), May 2008, DOI 10.1016/j.jcp.2008.01.047).
- Wen-mei Hwu et al. derive and discuss goals and concepts of programming models for fine-grained parallel architectures, from the point of view of both a programmer and a hardware /compiler designer, and analyze CUDA as one current representative (Implicitly parallel programming models for thousand-core microprocessors, Proceedings of DAC’07, June 2007, DOI 10.1145/1278480.1278669).
- Jeremy Sugerman et al. present GRAMPS, a prototype implementation of future graphics hardware that allows pipelines to be specified as graphs in software (GRAMPS: A Programming Model for Graphics Pipelines, ACM Transactions on Graphics 28(1), January 2009, DOI 10.1145/1477926.1477930).
- William R. Mark discusses concepts of future graphics architectures in this contribution to the 2008 ACM Queue special issue on GPUs (Future graphics architectures, ACM Queue 6(2), March/April 2008, DOI 10.1145/1365490.1365501).
- BSGP by Qiming Hou et al. is a new programming language for general purpose GPU computing that achieves the same efficiency as well-tuned CUDA programs but makes code much easier to read, develop and maintain (BSGP: bulk-synchronous GPU programming, ACM Siggraph 2008, August 2008, DOI 10.1145/1399504.1360618).
- Finally, Che et al. and Garland et al. survey the field of GPU computing and discuss many different application domains. These articles are, in addition to the ones we have collected on the developer pages, recommended to GPGPU newcomers.
Posted in Research, Site News | Tags: Computer Architecture, Data-Parallel, Molecular Dynamics, NVIDIA CUDA, Papers, Programming Languages | Write a comment
September 11th, 2009
Thrust (v1.1) is an open-source template library for developing CUDA applications. Modeled after the C++ Standard Template Library (STL), Thrust brings a familiar abstraction layer to the realm of GPU computing. Version 1.1 adds several new features, including:
To get started with Thrust, first download Thrust and then follow the online tutorial. Refer to the online documentation for a complete list of features. Many concrete examples and a set of introductory slides are also available. As the following code example shows, Thrust programs are concise and readable. Read the rest of this entry »
Posted in Developer Resources | Tags: Data-Parallel, Libraries, NVIDIA CUDA, Open Source, Parallel Algorithms, Sorting | Write a comment
August 23rd, 2009
Intel has acquired RapidMind, the company behind the RapidMind (formerly Sh) programming environment targeting multicore CPUs, AMD and NVIDIA GPUs and the Cell processor. The RapidMind Platform continues to be available, including support. In the medium term RapidMind’s technology and products will be integrated with Intel’s data-parallel products, in particular Intel’s Ct technology.
This blog entry by James Reinders from Intel describes the acquisition and future plans in more detail.
Posted in Business, Developer Resources | Tags: APIs, Data-Parallel, Programming Languages | Write a comment
July 1st, 2009
Release 1.1 of the CUDA Data-Parallel Primitives Library (CUDPP) is now available for download. The two major new features in CUDPP 1.1 are a very fast new radix sort implementation with support for sorting key-value pairs (with float or unsigned integer keys); and a new pseudorandom number generator, cudppRand. CUDPP 1.1 also replaces its former custom license with the standard BSD license. This greatly simplifies the CUDPP license details, and it also enables CUDPP to move into a public source repository such as Google Code in the near future. For more information, visit the CUDPP Website.
Posted in Developer Resources, Research | Tags: CUDPP, Data-Parallel, Libraries, NVIDIA CUDA, Open Source | Write a comment
June 24th, 2009
This NVIDIA technical report by Sengupta, Harris, and Garland describes the design of new parallel algorithms for scan and segmented scan on GPUs. This paper describes the primitives included in the latest release of the CUDPP library.
Abstract:
Scan and segmented scan algorithms are crucial building blocks for a great many data-parallel algorithms. Segmented scan and related primitives also provide the necessary support for the flattening transform, which allows for nested data-parallel programs to be compiled into flat data-parallel languages. In this paper, we describe the design of efficient scan and segmented scan parallel primitives in CUDA for execution on GPUs. Our algorithms are designed using a divide-and-conquer approach that builds all scan primitives on top of a set of primitive intra-warp scan routines. We demonstrate that this design methodology results in routines that are simple, highly efficient, and free of irregular access patterns that lead to memory bank conflicts. These algorithms form the basis for current and upcoming releases of the widely used CUDPP library.
(S. Sengupta, M. Harris, and M. Garland. Efficient parallel scan algorithms for GPUs. NVIDIA Technical Report NVR-2008-003, December 2008)
Posted in Research | Tags: CUDPP, Data-Parallel, Libraries, NVIDIA CUDA, Papers, Parallel Algorithms | Write a comment
April 28th, 2009
Abstract from the paper by Rehman et al.:
General purpose programming on graphics processing units (GPGPU) has received a lot of attention in the parallel computing community as it promises to offer the highest performance per dollar. While GPUs are usually used to tackle regular problems that can be easily parallelized, we describe two implementations of List Ranking—a traditional irregular algorithm that is difficult to parallelize on such massively multi-threaded hardware. In our best implementation, we introduce a GPU-optimized, recursive version of the Helman-JaJa algorithm. Our implementation can rank a random list of 8 million elements in just over 100 milliseconds, and achieves a speedup of about 8-9 over a CPU implementation as well as a speedup of 3-4 over the best reported implementation on the Cell Broadband Engine. We also discuss some practical issues that come to the fore when working with massively multi-threaded architectures, especially for algorithms with highly irregular memory access patterns. (M. Suhail Rehman, K. Kothapalli, P.J. Narayanan. Fast and Scalable List Ranking on the GPU. 23rd International Conference on Supercomputing (ICS). New York, USA, June 2009. (To Appear))
Posted in Research | Tags: Data-Parallel, List Ranking, Papers, Parallel Algorithms | Write a comment
March 1st, 2009
This IPDPS 2009 paper by Nadathur Satish, Mark Harris, and Michael Garland describes the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by NVIDIA CUDA. The radix sort described is the fastest GPU sort and the merge sort described is the fastest comparison-based GPU sort reported in the literature. The radix sort is up to 4 times faster than the graphics-based GPUSort and greater than 2 times faster than other CUDA-based radix sorts. It is also 23% faster, on average, than even a very carefully optimized multicore CPU sorting routine. To achieve this performance, the authors carefully design the algorithms to expose substantial fine-grained parallelism and decompose the computation into independent tasks that perform minimal global communication. They exploit the high-speed on-chip shared memory provided by NVIDIA’s GPU architecture and efficient data-parallel primitives, particularly parallel scan. While targeted at GPUs, these algorithms should also be well-suited for other manycore processors. (N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs. Proc. 23rd IEEE Int’l Parallel & Distributed Processing Symposium, May 2009. To appear.)
Posted in Research | Tags: Data-Parallel, NVIDIA CUDA, Parallel Algorithms, Sorting | Write a comment
April 20th, 2008
Version 1.0 alpha of CUDPP, the CUDA Data-Parallel Algorithms Library, has been released. This version adds the segmented scan algorithm and sparse matrix-vector multiplication to CUDPP’s repertoire. Other new features include an improved “plan”-based configuration interface, an improved scan algorithm for higher performance, support for more inclusive scans and more scan operators, an improved stream compaction interface. In addition, CUDPP 1.0a adds support for CUDA 2.0 and the Windows Vista and Mac OS X (10.5.2 and higher) operating systems. CUDPP works with NVIDIA CUDA versions 1.1 and higher.
Posted in Developer Resources | Tags: Data-Parallel, Linear Algebra, NVIDIA CUDA | Write a comment