Thrust: A Productivity-Oriented Library for CUDA

September 12th, 2011

Abstract:

This chapter demonstrates how to leverage the Thrust parallel template library to implement high-performance applications with minimal programming effort. Based on the C++ Standard Template Library (STL), Thrust brings a familiar high-level interface to the realm of GPU Computing while remaining fully interoperable with the rest of the CUDA software ecosystem. Applications written with Thrust are concise, readable, and efficient.

(Nathan Bell and Jared Hoberock: “Thrust: A Productivity-Oriented Library for CUDA”, GPU Computing Gems, Jade Edition, edited by Wen-mei W. Hwu, October 2011)

Non negative least squares on GPU/multicore architectures

September 4th, 2011

Abstract:

We parallelize a version of the active-set iterative algorithm derived from the original works of Lawson and Hanson (1974) on multi-core architectures. This algorithm requires the solution of an unconstrained least squares problem in every step of the iteration for a matrix composed of the passive columns of the original system matrix. To achieve improved performance, we use parallelizable procedures to efficiently update and {\em downdate} the QR factorization of the matrix at each iteration, to account for inserted and removed columns. We use a reordering strategy of the columns in the decomposition to reduce computation and memory access costs. We consider graphics processing units (GPUs) as a new mode for efficient parallel computations and compare our implementations to that of multi-core CPUs. Both synthetic and non-synthetic data are used in the experiments.

(Yuancheng Luo and Ramani Duraiswami, “Efficient Parallel Non-Negative Least Squares on Multicore Architectures”, SIAM Journal on Scientific Computing, accepted, Sep. 2011. [PDF] [Source code])

GPU Implementation of a Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method

September 2nd, 2011

Abstract:

A Helmholtz equation in two dimensions discretized by a second order finite difference scheme is considered. Krylov methods such as Bi-CGSTAB and IDR(s) have been chosen as solvers. Since the convergence of the Krylov solvers deteriorates with increasing wave number, a shifted Laplace multigrid preconditioner is used to improve the convergence. The implementation of the preconditioned solver on CPU (Central Processing Unit) is compared to an implementation on GPU (Graphics Processing Units or graphics card) using CUDA (Compute Unified Device Architecture). The results show that preconditioned Bi-CGSTAB on GPU as well as preconditioned IDR(s) on GPU is about 30 times faster than on CPU for the same stopping criterion.

(H. Knibbe, C.W. Oosterlee and C. Vuik, “GPU implementation of a Helmholtz Krylov solver preconditioned by a shifted Laplace multigrid method”, accepted for publication in the Journal of Computational and Applied Mathematics, 2011. [DOI])

Rigid body constraints realized in massively-parallel molecular dynamics on graphics processing units

August 20th, 2011

Abstract:

Molecular dynamics (MD) methods compute the trajectory of a system of point particles in response to a potential function by numerically integrating Newton’s equations of motion. Extending these basic methods with rigid body constraints enables composite particles with complex shapes such as anisotropic nanoparticles, grains, molecules, and rigid proteins to be modeled. Rigid body constraints are added to the GPU-accelerated MD package, HOOMD-blue, version 0.10.0. The software can now simulate systems of particles, rigid bodies, or mixed systems in microcanonical (NVE), canonical (NVT), and isothermalisobaric (NPT) ensembles. It can also apply the FIRE energy minimization technique to these systems. In this paper, we detail the massively parallel scheme that implements these algorithms and discuss how our design is tuned for the maximum possible performance. Two different case studies are included to demonstrate the performance attained, patchy spheres and tethered nanorods. In typical cases, HOOMD-blue on a single GTX 480 executes 2.5–3.6 times faster than LAMMPS executing the same simulation on any number of CPU cores in parallel. Simulations with rigid bodies may now be run with larger systems and for longer time scales on a single workstation than was previously even possible on large clusters.

(Trung Dac Nguyen, Carolyn L. Phillips, Joshua A. Anderson, and Sharon C. Glotzer: “Rigid body constraints realized in massively-parallel molecular dynamics on graphics processing units”, Computer Physics Communications 182(11):2307–2313, November 2011. [DOI])

Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs

August 19th, 2011

Abstract:

GPUs are excellent accelerators for data parallel applications with regular data access patterns. It is challenging, however, to optimize computations with irregular data access patterns on GPUs. One such computation is the Symmetric Matrix Vector product (SYMV) for dense linear algebra. Optimizing the SYMV kernel is important because it forms the basis of fundamental algorithms such as linear solvers and eigenvalue solvers on symmetric matrices. In this work, we present a new algorithm for optimizing the SYMV kernel on GPUs. Our optimized SYMV in single precision brings up to a 7x speed up compared to the (latest) CUBLAS 4.0 NVIDIA library on the GTX 280 GPU. Our SYMV kernel tuned for Fermi C2050 is 4.5x faster than CUBLAS 4.0 in single precision and 2x faster than CUBLAS 4.0 in double precision. Moreover, the techniques used and described in the paper are general enough to be of interest for developing high-performance GPU kernels beyond the particular case of SYMV.

(R. Nath, S. Tomov, T. Dong, and J. Dongarra, “Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs”, accepted for SC’11.  [WWW] [PDF])

CUDPP 2.0: parallel hash tables, tridiagonal solver, parallel reductions, and double precision

August 8th, 2011

CUDPP release 2.0 is a major new release of the CUDA Data-Parallel Primitives Library, with exciting new features. The public interface has undergone a minor redesign to provide thread safety. Parallel reductions (cudppReduce) and a tridiagonal system solver (cudppTridiagonal) have been added, and a new component library, cudpp_hash, provides fast data-parallel hash table functionality. In addition, support for 64-bit data types (double as well as long long and unsigned long long) has been added to all CUDPP algorithms, and a variety of bugs have been fixed.  For a complete list of changes, see the change log. CUDPP 2.0 is available for download now.

Solving ordinary differential equations with CUDA

August 8th, 2011

Odeint is a high level C++ library for solving ordinary differential equations. It is released under an open-source license and supports a variety of different methods for solving ODEs. As a special feature it supports different algebras which perform the basic mathematical operations. This allows the user to solve ordinary differential equations on modern graphic cards. A Thrust interface is implemented, so that the power of CUDA can easily be employed. Furthermore, arbitrary precision types can easily be supported.  Read the rest of this entry »

CentiLeo: interactive out-of-core GPU/CUDA ray tracer

August 4th, 2011

Implementing flexible software solutions, such as rendering and ray tracing, is still challenging for GPU programs. The amount of available memory on modern GPUs is relatively small.  Scenes for feature film rendering and visualization have large geometric complexity and can easily contain millions of polygons and a large number of texture maps and other data attributes. CentiLeo presents an interactive out-of-core ray tracing engine running on the single desktop GPU. The system is built around a virtual memory manager. A novel ray intersection algorithm built around an acceleration structure, cached on the GPU, loads needed data on-demand using page swapping. The ray tracing engine is used to implement a variety of rendering and light transport algorithms. The system is implemented using CUDA and runs on a single NVIDIA GTX 480.

Read the rest of this entry »

GPU.NET v2.0 released

July 29th, 2011

TidePowerd has released Version 2 of their GPU computing solution for the .NET framework, GPU.NET. Their platform allows developers to quickly and easily write GPU-accelerated applications completely in .NET-based languages. Some key benefits include:

  • Stay in C# and treat kernel methods like any regular method
  • “Boilerplate” GPU programming tasks such as memory transfer and GPU scheduling are abstracted from the developer
  • Cross-platform and cross-hardware with a single binary
  • Systems seamlessly adapt to new hardware without rewriting code
  • Speed on par with native code

New version 2 features:

  • Visual Studio Error list and IntelliSense integration
  • On-device random number generation
  • Double precision support

A free 30-days evaluation license is available, as well as in-depth examples and tutorials.

Jacket v1.8 and LibJacket v1.1 released

July 24th, 2011

Jacket 1.8 and LibJacket 1.1 have been released by Accelereyes, enabling GPU support for MATLAB and easier CUDA development with C/C++/Fortran and Python.  New features include:

  • Expanded support for the Signal Processing, Image Processing, and Statistics Libraries included with both Jacket and LibJacket
  • Faster linear algebra for special systems (e.g. symmetric, positive definite, triangular, etc.)
  • Enhanced visualizations
  • New and updated examples: FDTD, Mandelbrot fractals, maximum-likelihood neural segmentation, MDS for genomics
  • Built with CUDA 4.0 for peak performance

Visit http://www.accelereyes.com/ for details, downloads, whitepapers and tutorials.

Page 5 of 26« First...34567...1020...Last »