February 10th, 2010
February 9th, 2010
In large vocabulary continuous speech recognition (LVCSR) the acoustic model computations often account for the largest processing overhead. Our weighted finite state transducer (WFST) based decoding engine can utilize a commodity graphics processing unit (GPU) to perform the acoustic computations to move this burden off the main processor. In this paper we describe our new GPU scheme that can achieve a very substantial improvement in recognition speed whilst incurring no reduction in recognition accuracy. We evaluate the GPU technique on a large vocabulary spontaneous speech recognition task using a set of acoustic models with varying complexity and the results consistently show by using the GPU it is possible to reduce the recognition time with largest improvements occurring in systems with large numbers of Gaussians. For the systems which achieve the best accuracy we obtained between 2.5 and 3 times speed-ups. The faster decoding times translate to reductions in space, power and hardware costs by only requiring standard hardware that is already widely installed.
(Paul R. Dixon, Tasuku Oonishi, Sadaoki Furui, “Harnessing graphics processors for the fast computation of acoustic likelihoods in speech recognition”, Computer Speech & Language, Volume 23, Issue 4, October 2009, Pages 510-526, ISSN 0885-2308, DOI: 10.1016/j.csl.2009.03.005)
February 6th, 2010
The first textbook of its kind, Programming Massively Parallel Processors: A Hands-on Approach launches today, authored by Dr. David B. Kirk, NVIDIA Fellow and former chief scientist, and Dr. Wen-mei Hwu, who serves at the University of Illinois at Urbana-Champaign as Chair of Electrical and Computer Engineering in the Coordinated Science Laboratory, co-director of the Universal Parallel Computing Research Center and principal investigator of the CUDA Center of Excellence. The textbook, which is 256 pages, is the first aimed at teaching advanced students and professionals the basic concepts of parallel programming and GPU architectures. Published by Morgan-Kauffman, it explores various techniques for constructing parallel programs and reviews numerous case studies.
With conventional CPU-based computing no longer scaling in performance and the world’s computational challenges increasing in complexity, the need for massively parallel processing has never been greater. GPUs have hundreds of cores capable of delivering transformative performance increases across a wide range of computational challenges. The rise of these multi-core architectures has raised the need to teach advanced programmers a new and essential skill: how to program massively parallel processors.
Among the book’s key features:
- First and only text that teaches how to program within a massively parallel environment
- Portions of the NVIDIA-provided content have been part of the curriculum at 300 universities worldwide
- Drafts of sections of the book have been tested and taught by Kirk at the University of Illinois
- Book utilizes OpenCL and CUDA C, the NVIDIA parallel computing language developed specifically for massively parallel environments
Programming Massively Parallel Processors: A Hands-on Approach is available to purchase from Amazon or directly from Elsevier.
February 2nd, 2010
Dense matrix inversion is a basic procedure in many linear algebra algorithms. A computationally arduous step in most dense matrix inversion methods is the inversion of triangular matrices as produced by factorization methods such as LU decomposition. In this paper, we demonstrate how triangular matrix inversion (TMI) can be accelerated considerably by using commercial Graphics Processing Units (GPU) in a standard PC. Our implementation is based on a divide and conquer type recursive TMI algorithm, efficiently adapted to the GPU architecture. Our implementation obtains a speedup of 34x versus a CPU-based LAPACK reference routine, and runs at up to 54 gigaflops/s on a GTX 280 in double precision. Limitations of the algorithm are discussed, and strategies to cope with them are introduced. In addition, we show how inversion of an L- and U-matrix can be performed concurrently on a GTX 295 based dual-GPU system at up to 90 gigaflops/s.
(Florian Ries, Tommaso De Marco, Matteo Zivieri and Roberto Guerrieri, Triangular Matrix Inversion on Graphics Processing Units, Supercomputing 2009, DOI 10.1145/1654059.1654069)
February 2nd, 2010
We present HONEI, an open-source collection of libraries offering a hardware oriented approach to numerical calculations. HONEI abstracts the hardware, and applications written on top of HONEI can be executed on a wide range of computer architectures such as CPUs, GPUs and the Cell processor. We demonstrate the flexibility and performance of our approach with two test applications, a Finite Element multigrid solver for the Poisson problem and a robust and fast simulation of shallow water waves. By linking against HONEI’s libraries, we achieve a two-fold speedup over straight forward C++ code using HONEI’s SSE backend, and additional 3–4 and 4–16 times faster execution on the Cell and a GPU. A second important aspect of our approach is that the full performance capabilities of the hardware under consideration can be exploited by adding optimised application-specific operations to the HONEI libraries. HONEI provides all necessary infrastructure for development and evaluation of such kernels, significantly simplifying their development.
(Danny van Dyk, Markus Geveler, Sven Mallach, Dirk Ribbrock, Dominik Göddeke and Carsten Gutwenger: HONEI: A collection of libraries for numerical computations targeting multiple processor architectures. Computer Physics Communications 180(12), pp. 2534-2543, December 2009. DOI 10.1016/j.cpc.2009.04.018)
January 24th, 2010
As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moore’s law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs and FPGAs are becoming very popular in PC-based heterogeneous systems for speeding up compute intensive kernels of scientific, imaging and simulation applications. GPUs can execute hundreds of concurrent threads, while FPGAs provide customized concurrency for highly parallel kernels. However, exploiting the parallelism available in these applications is currently not a push-button task. Often the programmer has to expose the application’s fine and coarse grained parallelism by using special APIs. CUDA is such a parallel-computing API that is driven by the GPU industry and is gaining significant popularity. In this work, we adapt the CUDA programming model into a new FPGA design flow called FCUDA, which efficiently maps the coarse and fine grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SPMD CUDA thread blocks into parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multi-core accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.
(Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, Jason Cong and Wen-Mei W. Hwu, FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs, Proceedings of the 7th Symposium on Application Specific Processors, pp.35-42, July 2009. DOI: 10.1109/SASP.2009.5226333)
January 17th, 2010
From the abstract:
We present an inter-architectural comparison of single- and double-precision direct n-body implementations on modern multicore platforms, including those based on the Intel Nehalem and AMD Barcelona systems, the Sony-Toshiba-IBM PowerXCell/8i processor, and NVIDA Tesla C870 and C1060 GPU systems. We compare our implementations across platforms on a variety of proxy measures, including performance, coding complexity, and energy efficiency.
Nitin Arora, Aashay Shringarpure, and Richard Vuduc. “Direct n-body kernels for multicore platforms.” In Proc. Int’l. Conf. Parallel Processing (ICPP), Vienna, Austria, September 2009 (direct link to PDF).
December 8th, 2009
Occasionally, we receive news submissions pointing us to interesting older papers that somehow slipped by without our notice. This post collects a few of those. If you want your work to be posted on GPGPU.org in a timely manner, please remember to use the news submission form.
- Joshua A. Anderson, Chris D. Lorenz and Alex Travesset present and discuss molecular dynamics simulations and compare a single GPU against a 36-CPU cluster (General purpose molecular dynamics simulations fully implemented on graphics processing units, Journal of Computational Physics 227(10), May 2008, DOI 10.1016/j.jcp.2008.01.047).
- Wen-mei Hwu et al. derive and discuss goals and concepts of programming models for fine-grained parallel architectures, from the point of view of both a programmer and a hardware /compiler designer, and analyze CUDA as one current representative (Implicitly parallel programming models for thousand-core microprocessors, Proceedings of DAC’07, June 2007, DOI 10.1145/1278480.1278669).
- Jeremy Sugerman et al. present GRAMPS, a prototype implementation of future graphics hardware that allows pipelines to be specified as graphs in software (GRAMPS: A Programming Model for Graphics Pipelines, ACM Transactions on Graphics 28(1), January 2009, DOI 10.1145/1477926.1477930).
- William R. Mark discusses concepts of future graphics architectures in this contribution to the 2008 ACM Queue special issue on GPUs (Future graphics architectures, ACM Queue 6(2), March/April 2008, DOI 10.1145/1365490.1365501).
- BSGP by Qiming Hou et al. is a new programming language for general purpose GPU computing that achieves the same efficiency as well-tuned CUDA programs but makes code much easier to read, develop and maintain (BSGP: bulk-synchronous GPU programming, ACM Siggraph 2008, August 2008, DOI 10.1145/1399504.1360618).
- Finally, Che et al. and Garland et al. survey the field of GPU computing and discuss many different application domains. These articles are, in addition to the ones we have collected on the developer pages, recommended to GPGPU newcomers.
December 2nd, 2009
This paper presents, to the author’s knowledge, the first graphics processing unit (GPU) accelerated program that solves the evolution of interacting scalar fields in an expanding universe. We present the implementation in NVIDIA’s Compute Unified Device Architecture (CUDA) and compare the performance to other similar programs in chaotic inflation models. We report speedups between one and two orders of magnitude depending on the used hardware and software while achieving small errors in single precision. Simulations that used to last roughly one day to compute can now be done in hours and this difference is expected to increase in the future. The program has been written in the spirit of LATTICEEASY and users of the aforementioned program should find it relatively easy to start using CUDAEASY in lattice simulations. The program is available under the GNU General Public License.
The program is freely available at http://www.physics.utu.fi/theory/particlecosmology/cudaeasy/
(Jani Sainio. “CUDAEASY – a GPU Accelerated Cosmological Lattice Program”. submitted to Computer Physics Communications (under review). November 2009.)
November 30th, 2009
High throughput architectures for HPC seem likely to emphasize many cores with deep multithreading, wide SIMD, and sophisticated memory hierarchies. GPUs present one example, and their high throughput has led a number of researchers to port computationally intensive applications to NVIDIA’s CUDA architecture.
This session explored the art of performance tuning for CUDA using several case studies. Topics included profiling to identify bottlenecks, effective use of the GPU’s memory hierarchy and DRAM interface to maximize bandwidth, data versus task parallelism, and avoiding SIMD divergence. Many of the lessons learned in the context of CUDA are likely to apply to other many-core architectures used in HPC applications.
The presentation slides from the Supercomputing 2009 full-day tutorial “High-Performance Computing with CUDA” are now available at http://gpgpu.org/sc2009.
NVIDIA’s CUDA is a general-purpose architecture for writing highly parallel applications. CUDA provides several key abstractions—a hierarchy of thread blocks, shared memory, and barrier synchronization—for scalable high-performance parallel computing. Scientists throughout industry and academia use CUDA to achieve dramatic speedups on production and research codes. The CUDA architecture supports many languages, programming environments, and libraries including C, Fortran, OpenCL, DirectX Compute, Python, Matlab, FFT, LAPACK, etc.
In this tutorial NVIDIA engineers will partner with academic and industrial researchers to present CUDA and discuss its advanced use for science and engineering domains. The morning session will introduce CUDA programming, motivate its use with many brief examples from different HPC domains, and discuss tools and programming environments. The afternoon will discuss advanced issues such as optimization and sophisticated algorithms/data structures, closing with real-world case studies from domain scientists using CUDA for computational biophysics, fluid dynamics, seismic imaging, and theoretical physics.