Sure, you know GPUs, but have you heard of GPGPUs? The concept is simple: Use the massively parallel architecture of the graphics processor for general-purpose computing tasks. Because of that parallelism, ordinary calculations can be dramatically sped up. To create the Tesla, its powerful new entry into this market, NVIDIA has bundled multiple GPUs (without video connectors!) into either a board or a desk-side box that offers near-supercomputer levels of single-precision floating-point operations. The general-purpose GPU (thus the acronym GPGPU) is being used as a high-performance coprocessor for climate modeling, oil and gas exploration, and other applicationsâ€”and it’s much cheaper than a supercomputer. The Tesla even comes complete with its own C compiler and tools.
Neoptica, a computer graphics and parallel programming model startup founded by Matt Pharr and Craig Kolb, was acquired by Intel on October 19th. Beyond3D has posted a short writeup about the acquisition. Several of Neoptica’s employees have in the past been involved in GPGPU development.
On November 14th, 2002 Mark Harris created a web page on his personal site at the University of North Carolina to track the nascent research area of general-purpose computation on GPUs, naming it “GPGPU”. A year later that web page became GPGPU.org. GPGPU became an exciting research area, and GPUs are now being used in the “real world” of science, engineering, and business. You can see the original GPGPU web page (November 20, 2002) here, and an early version after it became GPGPU.org (August 6, 2003).
We’d like to thank everone who has contributed news, forum posts, and other content for GPGPU.org; this site would not exist without you. We encourage everyone to submit any and all GPGPU-related news using the “submit news” link in the sidebar. GPGPU.org depends on user-submitted news for its continued success!
The first part of this paper by Göddeke et al. surveys co-processor approaches for commodity based clusters in general, not only with respect to raw performance, but also in view of their system integration and power consumption. We then extend previous work on a small GPU cluster by exploring the heterogeneous hardware approach for a large-scale system with up to 160 nodes. Starting with a conventional commodity based cluster we leverage the high bandwidth of graphics processing units (GPUs) to increase the overall system bandwidth that is the decisive performance factor in this scenario. Thus, even the addition of low-end, out of date GPUs leads to improvements in both performance- and power-related metrics. (Dominik Göddeke, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick McCormick, Sven H.M. Buijssen, Matthias Grajewski and Stefan Turek. Exploring weak scalability for FEM calculations on a GPU-enhanced cluster. Parallel Computing 33:10-11. pp. 685-699. 2007.)
This article by Göddeke et al. explores the coupling of coarse and fine-grained parallelism for Finite Element simulations based on efficient parallel multigrid solvers. The focus lies on both system performance and a minimally invasive integration of hardware acceleration into an existing software package, requiring no changes to application code. Because of their excellent price performance ratio, we demonstrate the viability of our approach by using commodity graphics processors (GPUs) as efficient multigrid preconditioners. We address the issue of limited precision on GPUs by applying a mixed precision, iterative refinement technique. Other restrictions are also handled by a close interplay between the GPU and CPU. From a software perspective, we integrate the GPU solvers into the existing MPI-based Finite Element package by implementing the same interfaces as the CPU solvers, so that for the application programmer they are easily interchangeable. Our results show that we do not compromise any software functionality and gain speedups of two and more for large problems. Equipped with this additional option of hardware acceleration we compare different choices in increasing the performance of a conventional, commodity based cluster by increasing the number of nodes, replacement of nodes by a newer technology generation, and adding powerful graphics cards to the existing nodes. (Dominik Göddeke, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick McCormick, Hilmar Wobker, Christian Becker and Stefan Turek. Using GPUs to Improve Multigrid Solver Performance on a Cluster. Accepted for publication in the International Journal of Computational Science and Engineering.)
AMD has announced the AMD FireStream 9170 Stream Processor and an accompanying Software Development Kit (SDK) designed to harness the massive parallel processing power of the graphics processing unit (GPU). The AMD FireStream 9170 will support double-precision floating point technology tailored for scientific and engineering calculations. The AMD FireStream SDK is designed to deliver the tools developers need to create and optimize applications on AMD Stream processors. Built using an open platforms approach, the AMD FireStream SDK allows developers to access key Application Programming Interfaces (APIs) and specifications, enabling performance tuning at the lowest level and development of third party tools. Building on AMDâ€™s Close to the Metal (CTM) interface introduced in 2006, the Compute Abstraction Layer (CAL) provides low-level access to the GPU for development and performance tuning along with forward compatibility to future GPUs. For high-level development, AMD is announcing Brook+, a tool providing C extensions for stream computing based on the Brook project from Stanford University. In addition, AMD also plans to support the AMD Core Math Library (ACML) to provide GPU-accelerated math functions, and the COBRA video library accelerates video transcode. Also available are third-party tools from top industry partners including RapidMind and Microsoft. (Press Release)
CUDPP is the CUDA Data Parallel Primitives Library for NVIDIA CUDA. CUDPP is a library of data-parallel algorithm primitives such as parallel-prefix-sum (“scan”), parallel sort and parallel reduction. Primitives such as these are important building blocks for a wide variety of data-parallel algorithms, including sorting, stream compaction, and building data structures such as trees and summed-area tables. The first beta release of CUDPP is now available, as is the searchable online documentation.
The CIGPU-2008 special session on computational intelligence using consumer games and graphics hardwareNovember 5th, 2007
The CIGPU-2008 special session on computational intelligence using consumer games and graphics hardware invites submissions of novel scientific and engineering applications of GPUs. Papers submitted for special sessions will be peer-reviewed with the same criteria used for the contributed papers. Submission deadline is 7 January 2008. (WCCI-2008 Special Session Computational Intelligence on Consumer Games and Graphics Hardware CIGPU-2008)
From the introduction: “Processors architecture is evolving towards more software-exposed parallelism through two features: more cores and wider SIMD ISA. At the same time, graphics processors (GPUs) are gradually adding more general purpose programming features. Several software development challenges arise from these trends. First, how do we mitigate the increased software development complexity that comes with exposing parallelism to the developer? Second, how do we provide portability across (increasing) core counts and SIMD ISA? Ct is a deterministic parallel programming model intended to leverage the best features of emerging general-purpose GPU (GPGPU) programming models while fully exploiting CPU flexibility. A key distinction of Ct is that it comprises a top-down design of a complete data parallel programming model, rather than being driven bottomup by architectural limitations, a flaw in many GPGPU programming models.” (Flexible Parallel Programming for Terascale Architectures with Ct)
This paper by Moss et. al shows an implementation of multi-precision arithmetic running on a 7800-GTX. The paper shows how to compute the modular exponentiation of large integers (a central operation in the RSA cryptosystem) using the restricted control flow available on a DX9 card. Both the background number theory used to express the problem in a suitable way for a streaming architecture, and the program transformation techniques used to generate the GLSL code are described in detail. Surprisingly (given the unusual nature of the problem for GPGPU) the GPU is capable of out-performing the CPU over a large enough dataset by a factor of 2x-3x depending on the CPU implementation. Unfortunately the immature state of the GLSL compiler prevents a further 2x improvement by allocating too many registers, and the large latency for setting the problem up means that over 800 exponentiations need to be performed to break-even against the CPU. (Andrew Moss, Dan Page and Nigel Smart. Toward Acceleration of RSA Using 3D Graphics Hardware. In: LNCS 4887, pages 369–388. Springer, December 2007.)