March 9th, 2010
March 9th, 2010
Yellow Dog Enterprise Linux for CUDA (YDEL for CUDA) is an open source, Linux operating system built for faster, easier, and more reliable GPU Computing. YDEL for CUDA, released and supported by Fixstars, goes beyond the basic Linux OS and integrates support for GPUs, NVIDIA CUDA, and GPU development tools.
From the YDEL for CUDA website:
Key benefits of Yellow Dog Enterprise Linux for CUDA:
- YDEL for CUDA users can experience up to a 9% performance improvement in some applications.
- Comprehensive support is offered to paid subscriptions with our skilled team able to assist you with both Linux and CUDA.
- YDEL’s unparalleled integrations means everything you need to write and run CUDA applications is included and configured.
- YDEL includes multiple versions of CUDA and can easily switch between them via a setting in a configuration file or an environment variable.
- Never worry about updates affecting your system, Fixstars offers YDEL users greater reliability with our strenuous test procedures that validate GPU computing functionality and performance.
For more information, visit the YDEL for CUDA website.
March 3rd, 2010
Swan is a small tool that aids the reversible conversion of existing CUDA codebases to OpenCL. Its main features are the translation of CUDA kernel source-code to OpenCL, and a common API that abstracts both CUDA and OpenCL runtimes. Swan preserves the convenience of the CUDA <<< grid, block >>> kernel launch syntax by generating C source-code for kernel entry-point functions. Possible uses include:
- Evaluating OpenCL performance of an existing CUDA code
- Maintaining a dual-target OpenCL and CUDA code
- Reducing dependence on NVCC when compiling host code
- Support multiple CUDA compute capabilities in a single binary
Swan is developed by the MultiscaleLab, Barcelona, and is available under the GPL2 license.
March 1st, 2010
We have previously suggested mixed precision iterative solvers specifically tailored to the iterative solution of sparse linear equation systems as they typically arise in the finite element discretization of partial differential equations. These schemes have been evaluated for a number of hardware platforms, in particular single precision GPUs as accelerators to the general purpose CPU. This paper reevaluates the situation with new mixed precision solvers that run entirely on the GPU: We demonstrate that mixed precision schemes constitute a significant performance gain over native double precision. Moreover, we present a new implementation of cyclic reduction for the parallel solution of tridiagonal systems and employ this scheme as a line relaxation smoother in our GPU-based multigrid solver. With an alternating direction implicit variant of this advanced smoother we can extend the applicability of the GPU multigrid solvers to very ill-conditioned systems arising from the discretization on anisotropic meshes, that previously had to be solved on the CPU. The resulting mixed precision schemes are always faster than double precision alone, and outperform tuned CPU solvers consistently by almost an order of magnitude.
(Dominik Göddeke and Robert Strzodka: “Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid” , accepted in: IEEE Transactions on Parallel and Distributed Systems, Special Issue: High Performance Computing with Accelerators, Mar. 2010. Link.)
February 28th, 2010
GMAC (Global Memory for ACcelerators) is a user-level library that implements an Asymmetric Distributed Shared Memory model to be used by CUDA programs. An ADSM model allows CPU code to access data hosted in accelerator (GPU) memory. In this model, a single pointer is used for data structures accessed both in the CPU and the GPU and the coherency of the data is transparently handled by the library. Moreover, the data allocated with GMAC can be accessed by all the host threads of the program. That makes your code simpler and cleaner. GMAC currently supports programs programmed with CUDA, but OpenCL support is planned.
A paper describing the Asymmetric Distributed Shared Memory model and its implementation in GMAC has been accepted in the ASPLOS XV conference. GMAC is being developed by the Operating System Group at the Universitat Politecnica de Catalunya and the IMPACT Research Group at the University of Illinois. Binary pre-compiled packages, the source code, documentation and examples are available at the project website.
(Isaac Gelado, Javier Cabezas, John Stone, Sanjay Patel, Nacho Navarro and Wen-mei Hwu, “An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems”, accepted in: Fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2010), March 2010.)
February 14th, 2010
We present an efficient method for the simulation of laminar fluid flows with free surfaces including their interaction with moving rigid bodies, based on the two-dimensional shallow water equations and the Lattice-Boltzmann method. Our implementation targets multiple fundamentally different architectures such as commodity multicore CPUs with SSE, GPUs, the Cell BE and clusters. We show that our code scales well on an MPI-based cluster; that an eightfold speedup can be achieved using modern GPUs in contrast to multithreaded CPU code and, finally, that it is possible to solve fluid-structure interaction scenarios with high resolution at interactive rates.
(Markus Geveler, Dirk Ribbrock, Dominik Göddeke and Stefan Turek: “Lattice-Boltzmann Simulation of the Shallow-Water Equations with Fluid-Structure Interaction on Multi- and Manycore Processors”, Accepted in: Facing the Multicore Challenge, Heidelberg, Germany, Mar. 2010. Link.)
February 11th, 2010
OpenNL (Open Numerical Library) is a library for solving sparse linear systems, especially designed for the Computer Graphics community. The goal of OpenNL is to be as small as possible, while offering the subset of functionalities required by this application field. The Makefiles of OpenNL can generate a single .c and .h file that make it very easy to integrate into other projects. The distribution includes an implementation of a Least Squares Conformal Maps parameterization method. The new version 3.0 of OpenNL includes support for CUDA (with Concurrent Number Cruncher and CUSP ELL formats).
February 10th, 2010
The developers of the CUDPP (CUDA Data-Parallel Primitives) Library request that users (past and current) of the CUDPP Library fill out the CUDPP Survey. This survey will help the CUDPP Team prioritize new development and support for existing and new features.
February 9th, 2010
In large vocabulary continuous speech recognition (LVCSR) the acoustic model computations often account for the largest processing overhead. Our weighted finite state transducer (WFST) based decoding engine can utilize a commodity graphics processing unit (GPU) to perform the acoustic computations to move this burden off the main processor. In this paper we describe our new GPU scheme that can achieve a very substantial improvement in recognition speed whilst incurring no reduction in recognition accuracy. We evaluate the GPU technique on a large vocabulary spontaneous speech recognition task using a set of acoustic models with varying complexity and the results consistently show by using the GPU it is possible to reduce the recognition time with largest improvements occurring in systems with large numbers of Gaussians. For the systems which achieve the best accuracy we obtained between 2.5 and 3 times speed-ups. The faster decoding times translate to reductions in space, power and hardware costs by only requiring standard hardware that is already widely installed.
(Paul R. Dixon, Tasuku Oonishi, Sadaoki Furui, “Harnessing graphics processors for the fast computation of acoustic likelihoods in speech recognition”, Computer Speech & Language, Volume 23, Issue 4, October 2009, Pages 510-526, ISSN 0885-2308, DOI: 10.1016/j.csl.2009.03.005)
February 6th, 2010
The first textbook of its kind, Programming Massively Parallel Processors: A Hands-on Approach launches today, authored by Dr. David B. Kirk, NVIDIA Fellow and former chief scientist, and Dr. Wen-mei Hwu, who serves at the University of Illinois at Urbana-Champaign as Chair of Electrical and Computer Engineering in the Coordinated Science Laboratory, co-director of the Universal Parallel Computing Research Center and principal investigator of the CUDA Center of Excellence. The textbook, which is 256 pages, is the first aimed at teaching advanced students and professionals the basic concepts of parallel programming and GPU architectures. Published by Morgan-Kauffman, it explores various techniques for constructing parallel programs and reviews numerous case studies.
With conventional CPU-based computing no longer scaling in performance and the world’s computational challenges increasing in complexity, the need for massively parallel processing has never been greater. GPUs have hundreds of cores capable of delivering transformative performance increases across a wide range of computational challenges. The rise of these multi-core architectures has raised the need to teach advanced programmers a new and essential skill: how to program massively parallel processors.
Among the book’s key features:
- First and only text that teaches how to program within a massively parallel environment
- Portions of the NVIDIA-provided content have been part of the curriculum at 300 universities worldwide
- Drafts of sections of the book have been tested and taught by Kirk at the University of Illinois
- Book utilizes OpenCL and CUDA C, the NVIDIA parallel computing language developed specifically for massively parallel environments
Programming Massively Parallel Processors: A Hands-on Approach is available to purchase from Amazon or directly from Elsevier.
Dense matrix inversion is a basic procedure in many linear algebra algorithms. A computationally arduous step in most dense matrix inversion methods is the inversion of triangular matrices as produced by factorization methods such as LU decomposition. In this paper, we demonstrate how triangular matrix inversion (TMI) can be accelerated considerably by using commercial Graphics Processing Units (GPU) in a standard PC. Our implementation is based on a divide and conquer type recursive TMI algorithm, efficiently adapted to the GPU architecture. Our implementation obtains a speedup of 34x versus a CPU-based LAPACK reference routine, and runs at up to 54 gigaflops/s on a GTX 280 in double precision. Limitations of the algorithm are discussed, and strategies to cope with them are introduced. In addition, we show how inversion of an L- and U-matrix can be performed concurrently on a GTX 295 based dual-GPU system at up to 90 gigaflops/s.
(Florian Ries, Tommaso De Marco, Matteo Zivieri and Roberto Guerrieri, Triangular Matrix Inversion on Graphics Processing Units, Supercomputing 2009, DOI 10.1145/1654059.1654069)