Release 1.1 of the CUDA Data-Parallel Primitives Library (CUDPP) is now available for download. The two major new features in CUDPP 1.1 are a very fast new radix sort implementation with support for sorting key-value pairs (with float or unsigned integer keys); and a new pseudorandom number generator, cudppRand. CUDPP 1.1 also replaces its former custom license with the standard BSD license. This greatly simplifies the CUDPP license details, and it also enables CUDPP to move into a public source repository such as Google Code in the near future. For more information, visit the CUDPP Website.
This paper reports on CuPP, our newly developed C++ framework designed to ease integration of NVIDIA’s GPGPU system, CUDA, into existing C++ applications. CuPP provides interfaces to reoccurring tasks that are easier to use than the standard CUDA interfaces. In this paper we concentrate on memory management and related data structures. CuPP offers both a low level interface — mostly consisting of smart pointers and memory allocation functions for GPU memory — and a high level interface offering a C++ STL vector wrapper and the so-called type transformations. The wrapper can be used by both device and host to automatically keep data in sync. The type transformations allow developers to write their own data structures offering the same functionality as the CuPP vector, in case a vector does not conform to the need of the application. Furthermore the type transformations offer a way to have two different representations for the same data at host and device, respectively. We demonstrate the benefits of using CuPP by integrating it into an example application, the open-source steering library OpenSteer. In particular, for this application we develop a uniform grid data structure to solve the k-nearest neighbor problem that deploys the type transformations. The paper finishes with a brief outline of another CUDA application, the Einstein@Home client, which also requires data structure redesign and thus may benefit from the type transformations and future work on CuPP.
(Jens Breitbart: CuPP – A framework for easy CUDA integration, HiPS 2009 workshop with IPDPS 2009, Rome, Italy, May 2009)
A tutorial on High Performance Computing with CUDA was held at the International Conference on Supercomputing in Hamburg on Monday, June 22nd 2009. The tutorial included an introduction to the CUDA programming model and C for CUDA, along with details on the CUDA Toolkit, Libraries, and optimization. The tutorial also provided an introduction to OpenCL, and finished with a case study on Computational Fluid Dynamics by Dr. Graham Pullan from Cambridge University. Slides from the tutorial are now posted here on GPGPU.org.
(Massimiliano Fatica, Timo Stich, and Graham Pullan. High Performance Computing with CUDA. Tutorial. International Conference on Supercomputing 2009. Hamburg, Germany.)
This NVIDIA technical report by Sengupta, Harris, and Garland describes the design of new parallel algorithms for scan and segmented scan on GPUs. This paper describes the primitives included in the latest release of the CUDPP library.
Scan and segmented scan algorithms are crucial building blocks for a great many data-parallel algorithms. Segmented scan and related primitives also provide the necessary support for the flattening transform, which allows for nested data-parallel programs to be compiled into flat data-parallel languages. In this paper, we describe the design of efficient scan and segmented scan parallel primitives in CUDA for execution on GPUs. Our algorithms are designed using a divide-and-conquer approach that builds all scan primitives on top of a set of primitive intra-warp scan routines. We demonstrate that this design methodology results in routines that are simple, highly efficient, and free of irregular access patterns that lead to memory bank conflicts. These algorithms form the basis for current and upcoming releases of the widely used CUDPP library.
(S. Sengupta, M. Harris, and M. Garland. Efficient parallel scan algorithms for GPUs. NVIDIA Technical Report NVR-2008-003, December 2008)
Yesterday the Portland Group and NVIDIA announced plans to develop new Fortran language support for CUDA GPUs. The pair will release the Fortran language specification for CUDA GPUs at the International Conference on Supercomputing in Hamburg, Germany this week. The CUDA Fortran compiler will be added to a production release of the PGI Fortran compilers scheduled for availability in November 2009.
From the press release:
The Portland Group®, a wholly-owned subsidiary of STMicroelectronics and leading supplier of compilers for high-performance computing (HPC), today announced an agreement with NVIDIA under which the two companies plan to develop new Fortran language support for CUDA GPUs.
The NVIDIA® CUDA™ architecture allows developers to offload computationally intensive kernels to the massively parallel GPU. Through function calls and language extensions, CUDA gives developers explicit control over the mapping of general-purpose computational kernels to GPUs as well as placement and movement of data between the x64 processor and the GPU. The NVIDIA CUDA C compiler already provides this capability to C programmers. The CUDA Fortran compiler will provide this same level of control and optimization in a native Fortran environment from PGI.
Yesterday The Portland Group announced the release of version 9.0 of its Fortran and C compilers with support for GPUs and x64 multi-core CPUs. An introduction to PGI Accelerator Fortran and C programming is available online, as is the PGI Accelerator v1.0 specification. Evaluation copies of the new PGI 9.0 compilers are available from The Portland Group web site. Registration is required.
From the press release:
The use of Graphics Processing Units (GPUs) as general purpose accelerators has been a growing trend in high-performance computing (HPC). Until now, use of GPUs from Fortran applications has been extremely limited. Developers targeting GPU accelerators have had to program in C at a detailed level using sequences of function calls to manage movement of data between the x64 host and GPU, and to offload computations from the host to the GPU. The PGI Accelerator Fortran and C compilers automatically analyze whole program structure and data, split portions of an application between a multi-core x64 CPU and a GPU as specified by user directives, and define and generate a mapping of loops to automatically use the parallel cores, hardware threading capabilities and SIMD vector capabilities of modern GPUs.
The paper Fast Seismic Modeling and Reverse Time Migration on a GPU Cluster by Rached Abdelkhalek, Henri Calandra, Olivier Coulaud, Jean Roman and Guillaume Latu has earned the Best Paper Award at High Performance Computing and Simulation 2009, held June 21-24 in Leipzig, Germany.
This paper was presented in the Workshop on Architecture-Aware Simulation and Computing, organized by Michael Bader and Josef Weidendorfer (Technische Universität München). Three other GPGPU papers were part of this workshop:
- GPU Acceleration of an Unmodified Parallel Finite Element Navier–Stokes Solver by Dominik Göddeke, Sven H.M. Buijssen, Hilmar Wobker and Stefan Turek. This contribution also received a Best Paper Award nomination.
- Comparing CUDA and OpenGL Implementations for a Jacobi Iteration by Ronan Amorim, Gundolf Haase, Manfred Liebmann and Rodrigo Weber dos Santos
- Data Structure Design for GPU Based Heterogeneous Systems by Jens Breitbart
The abstract of the award-winning paper is: Read the rest of this entry »
CUDA GPU memtest is a memory test utility for NVIDIA GPU memory that uses well-established patterns from memtest86/memtest86+ as well as additional stress tests. The tests are designed to find hard and soft memory errors.
CUDA GPU memtest is available via anonymous SVN from sourceforge and developed by Guochun Shi and Jeremy Enos.
R is a popular open source environment for statistical computing, widely used in many application domains. The ongoing R+GPU project is devoted to moving frequently used R functions, mostly functions used in biomedical research, to the GPU using CUDA. If a CUDA-compatible GPU and driver are present on a user’s machine, the user may only need to prefix “gpu” to the original function name to take advantage of the GPU implementation of the corresponding R function.
Speedup measurements of the current implementation range as high as 80x, and contributions to the code base are cordially invited. R+GPU is developed at the University of Michigan’s Molecular and Behavioral Neuroscience Institute
NVIDIA is offering a series of free GPU computing webinars covering a range of topics from a basic introduction to the CUDA architecture to advanced topics such as data structure optimization and multi-GPU usage.
There are several webinars scheduled already; attendees are encouraged to pick the date and time which best suits their schedule. Visit the NVIDIA GPU Computing Online Seminars webpage for webinar registration and further information. Additional webinars will be scheduled throughout the next few months so check for future alerts and visit the NVIDIA online seminar schedule page often.