CUDA is a parallel computing architecture and programming model developed by NVIDIA. The CUDA architecture includes an assembly language (PTX) and compilation technology that is the basis on which multiple parallel language and API interfaces are built on NVIDIA GPUs, including C (C++) for CUDA, OpenCL, Fortran, and DirectX Compute Shaders. C for CUDA uses the standard C language with extensions, and exposes hardware features that are not available in traditional OpenGL or Direct3D (other than Compute Shaders). The most important of these new features are shared memory, which can greatly improve the performance of bandwidth-limited applications; double precision floating point arithmetic; and an arbitrary load/store memory model, which enables many new algorithms which were previously difficult or impossible to implement on the GPU.

Jump to


NVIDIA’s CUDA ZONE provides a wealth of information on CUDA, including all of the following.

The CUDA toolkit includes the CUBLAS and CUFFT libraries, which are optimized implementations of BLAS (Basic Linear Algebra Subroutines) and Fast Fourier Transforms for NVIDIA GPUs, implemented in C for CUDA.  See also CUDPP, below.

Further Reading

In addition to the material available at CUDA ZONE, the following articles are highly recommended:

  • The book GPU GEMS 3, edited by Hubert Nguyen, contains many chapters dedicated to CUDA and parallel programming techniques. It is now freely available on NVIDIA’s Developer pages.
  • The March/April 2008 edition of ACM Queue contains four highly recommended articles on GPU architecture, scalable parallel programming with CUDA and parallel programming techniques.
  • Erik Lindholm, John Nickolls, Stuart Oberman and John Montrym: NVIDIA Tesla: A unified graphics and computing architecture, IEEE Micro, 28(2), 39–55, March 2008
  • This article by Tom R. Halfhill in the January 28, 2008 issue of Microprocessor Report discusses parallel computing with massive multiprocessing on GPUs using NVIDIA CUDA.


The NVIDIA CUDA Developer SDK contains many examples with source code to help geting started with CUDA. Examples include:

  • Parallel bitonic sort
  • Matrix multiplication
  • Matrix transpose
  • Performance profiling using timers
  • Parallel prefix sum (scan) of large arrays
  • Image convolution
  • 1D DWT using Haar wavelet
  • OpenGL and Direct3D graphics interoperation examples
  • CUDA BLAS and FFT library usage examples
  • CPU-GPU C- and C++-code integration
  • Binomial Option Pricing
  • Black-Scholes Option Pricing
  • Monte-Carlo Option Pricing
  • Parallel Mersenne Twister (random number generation)
  • Parallel Histogram
  • Image Denoising
  • Sobel Edge Detection Filter
  • MathWorks MATLAB Plug-in

Minimalistic CUDA Tutorial

Because of the CUDA SDK’s infrastructure for simplifying common tasks for all samples, SDK sample code can be somewhat daunting for beginners searching for a minimalistic, self-contained example.  Therefore, GPGPU.ORG provides a single-file version of the first example presented in the ACM Queue article by John Nickolls, Ian Buck, Michael Garland and Kevin Skadron, a scaled vector-vector addition.  This example was written by Dominik Göddeke.

Download the Simple CUDA Tutorial

Parallel Reductions

Reductions are commonly used in parallel computing to compute scalar quantities from vector data, such as maximum, minimum, dot products and norms. Tutorial code for this building block is available in the CUDA SDK application “reduction”, and these slides by Mark Harris provide thorough documentation and explanation of the underlying ideas. Note that a copy of the slides is also provided in the projects/reduction/doc directory of the CUDA SDK code.


CUDPP is the CUDA Data Parallel Primitives Library. CUDPP is a library of data-parallel algorithm primitives such as parallel prefix-sum (“scan”), parallel sort and parallel reduction. Primitives such as these are important building blocks for a wide variety of data-parallel algorithms, including sorting, stream compaction, and building data structures such as trees and summed-area tables.