GPU++: An Embedded GPU Development System for General-Purpose Computations

January 14th, 2008

This Ph.D. thesis by Jansen describes a GPGPU development system that is embedded in the C++ programming language using ad-hoc polymorphism (i.e. operator overloading). While this technique is already known from the Sh library and the RapidMind Development Platform, GPU++ uses a more generic class interface and requires no knowledge of GPU programming at all. Furthermore, there is no separation between the different computation units of the CPU and GPU – the appropriate computation frequency is automatically chosen by the GPU++ system using several optimization algorithms. (“GPU++: An Embedded GPU Development System for General-Purpose Computations“. Thomas Jansen. Ph.D. Thesis, University of Munich, Germany.

Exploring weak scalability for FEM calculations on a GPU-enhanced cluster

November 15th, 2007

The first part of this paper by Göddeke et al. surveys co-processor approaches for commodity based clusters in general, not only with respect to raw performance, but also in view of their system integration and power consumption. We then extend previous work on a small GPU cluster by exploring the heterogeneous hardware approach for a large-scale system with up to 160 nodes. Starting with a conventional commodity based cluster we leverage the high bandwidth of graphics processing units (GPUs) to increase the overall system bandwidth that is the decisive performance factor in this scenario. Thus, even the addition of low-end, out of date GPUs leads to improvements in both performance- and power-related metrics. (Dominik Göddeke, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick McCormick, Sven H.M. Buijssen, Matthias Grajewski and Stefan Turek. Exploring weak scalability for FEM calculations on a GPU-enhanced cluster. Parallel Computing 33:10-11. pp. 685-699. 2007.)

Using GPUs to Improve Multigrid Solver Performance on a Cluster

November 15th, 2007

This article by Göddeke et al. explores the coupling of coarse and fine-grained parallelism for Finite Element simulations based on efficient parallel multigrid solvers. The focus lies on both system performance and a minimally invasive integration of hardware acceleration into an existing software package, requiring no changes to application code. Because of their excellent price performance ratio, we demonstrate the viability of our approach by using commodity graphics processors (GPUs) as efficient multigrid preconditioners. We address the issue of limited precision on GPUs by applying a mixed precision, iterative refinement technique. Other restrictions are also handled by a close interplay between the GPU and CPU. From a software perspective, we integrate the GPU solvers into the existing MPI-based Finite Element package by implementing the same interfaces as the CPU solvers, so that for the application programmer they are easily interchangeable. Our results show that we do not compromise any software functionality and gain speedups of two and more for large problems. Equipped with this additional option of hardware acceleration we compare different choices in increasing the performance of a conventional, commodity based cluster by increasing the number of nodes, replacement of nodes by a newer technology generation, and adding powerful graphics cards to the existing nodes. (Dominik Göddeke, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick McCormick, Hilmar Wobker, Christian Becker and Stefan Turek. Using GPUs to Improve Multigrid Solver Performance on a Cluster. Accepted for publication in the International Journal of Computational Science and Engineering.)

Intel Ct Tera-Scale White paper

November 5th, 2007

From the introduction: “Processors architecture is evolving towards more software-exposed parallelism through two features: more cores and wider SIMD ISA. At the same time, graphics processors (GPUs) are gradually adding more general purpose programming features. Several software development challenges arise from these trends. First, how do we mitigate the increased software development complexity that comes with exposing parallelism to the developer? Second, how do we provide portability across (increasing) core counts and SIMD ISA? Ct is a deterministic parallel programming model intended to leverage the best features of emerging general-purpose GPU (GPGPU) programming models while fully exploiting CPU flexibility. A key distinction of Ct is that it comprises a top-down design of a complete data parallel programming model, rather than being driven bottomup by architectural limitations, a flaw in many GPGPU programming models.” (Flexible Parallel Programming for Terascale Architectures with Ct)

Toward Acceleration of RSA Using 3D Graphics Hardware

November 5th, 2007

This paper by Moss et. al shows an implementation of multi-precision arithmetic running on a 7800-GTX. The paper shows how to compute the modular exponentiation of large integers (a central operation in the RSA cryptosystem) using the restricted control flow available on a DX9 card. Both the background number theory used to express the problem in a suitable way for a streaming architecture, and the program transformation techniques used to generate the GLSL code are described in detail. Surprisingly (given the unusual nature of the problem for GPGPU) the GPU is capable of out-performing the CPU over a large enough dataset by a factor of 2x-3x depending on the CPU implementation. Unfortunately the immature state of the GLSL compiler prevents a further 2x improvement by allocating too many registers, and the large latency for setting the problem up means that over 800 exponentiations need to be performed to break-even against the CPU. (Andrew Moss, Dan Page and Nigel Smart. Toward Acceleration of RSA Using 3D Graphics Hardware. In: LNCS 4887, pages 369–388. Springer, December 2007.)

Graphics-based Acoustic Simulations

November 5th, 2007

Physically correct acoustic simulations for complex and dynamic environments remain a difficult and computationally extensive task. Graphics hardware is here used for the simulation of sound wave propagation. Two different methods have been implemented, of which one uses ray tracing techniques, while the other is based on difference equations and waveguide meshes. Both techniques can efficiently be implemented within a real-time environment by concentrating on the similarities for sound and light wave propagation, and by exploiting the possibilities of using graphics hardware for non-graphics computations. Applications are discussed for real-time room acoustics, virtual reality as well as for virtual HRIR measurements based on polygonal meshes.

(Ray Acoustics using Computer Graphics Technology. Niklas Röber, Ulrich Kaminski, and Maic Masuch. Proceedings of DAFx 2007.)
(Waveguide-based Room Acoustics through Graphics Hardware. Niklas Röber, Martin Spindler, and Maic Masuch. Proceedings of ICMC 2006.)

Quantum Monte Carlo on GPUs

September 10th, 2007

This paper by Anderson et al at Caltech describes a method to use GPUs to accelerate Quantum Monte Carlo on a GPU. QMC is among the most accurate (and expensive) methods in the quantum chemistry zoo. Primarily, this involves the investigation of tricks available to this algorithm to speed up matrix multiplication. That is, as a statistical algorithm, the authors studied the performance enhancements available when multiplying many matrices simultaneously. Additionally, the paper explores the Kahan Summation Formula to improve the accuracy of GPU matrix multiplication. (Quantum Monte Carlo on Graphical Processing Units. Amos G. Anderson, William A Goddard III, Peter Schroder. Computer Physics Communications)

Graphic processors to speed-up simulations for the design of high performance solar receptors

September 4th, 2007

This paper by Collange et al. at Université de Perpignan, France, decribes a prototype to be integrated into simulation codes that estimate temperature, velocity and pressure to design next generation solar receptors. Such codes delegate to GPUs the computation of heat transfer due to radiation. The authors use Monte-Carlo line-by-line ray-tracing through finite volumes. This means data-parallel arithmetic transformations on large data structures. The performance on two recent graphics cards (Nvidia 7800GTX and ATI RX1800XL) show speedups higher than 400 compared to CPU implementations leaving most of CPU computing resources available. As there were some questions pending about the accuracy of the operators implemented in GPUs, the authors start this report with a survey and some contributed tests on the various floating point units available on GPUs. (Graphic processors to speed-up simulations for the design of high performance solar receptors. S. Collange, M. Daumas, D. Defour. Proceedings of the IEEE 18th International Conference on Application-specific Systems, Architectures and Processors.)

Graphics Hardware 2007 Papers

August 16th, 2007

On 4-5 August 2007, San Diego hosted the annual Graphics Hardware conference. GPGPU figured prominently in three papers:

  • As transistors get smaller, their transient failure rates increase. Future architectures must adapt to address the resulting reliability problems. Jeremy Sheaffer presented a paper demonstrating a hardware-based redundancy approach to ensure reliability on GPGPU applications. (“A Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics Processors”. Jeremy Sheaffer, University of Virginia; David Luebke, NVIDIA Research; Kevin Skadron, University of Virginia.)
  • Magnus Strengert presented a generic, minimally intrusive, and application-transparent GLSL debugger that operates transparently to the application. In it, shader debugging is performed on a per-draw call level; it allows singlestepping and the inspection of arbitrary variable content. Linux code is available and Windows code is expected by the end of the year. (“A Hardware-Aware Debugger for the OpenGL Shading Language”. Magnus Strengert, Thomas Klein, and Thomas Ertl, University of Stuttgart.)
  • One critical need for GPGPU developers is a library of general-purpose building blocks for GPU computation. Shubhabrata Sengupta presented a paper describing a GPU implementation of the “scan primitives” and their use in novel GPU implementations of quicksort, efficient sparse matrix-vector multiplication, and tridiagonal matrix systems. This paper won the Best Paper award and the authors are preparing an open-source release. (“Scan Primitives for GPU Computing”. Shubhabrata Sengupta, UC Davis; Mark Harris, NVIDIA Corporation; Yao Zhang, UC Davis; John D. Owens, UC Davis.)

All Graphics Hardware 2007 papers are available in the ACM digital library. In addition, the GH07 program page contains slides for all talks as well as two keynote talks (Chas. Boyd of the Microsoft DirectX team: “Mass Market Applications of Data-Parallel Computing” and Michael Jones, chief technologist of Google Earth: “GPUs for the true mass market”) and vendor talks from AMD and NVIDIA about their latest processors (AMD Radeon HD 2900 and NVIDIA’s Tesla).

Two-electron Integral Evaluation on the Graphics Processor Unit

August 16th, 2007

Abstract: We propose the algorithm to evaluate the Coulomb potential in the ab initio density functional calculation on the graphics processor unit (GPU). The numerical accuracy required for the algorithm is investigated in detail. It is shown that GPU, which supports only the single-precision floating number natively, can take part in the major computational tasks. Because of the limited size of the working memory, the Gauss-Rys quadrature to evaluate the electron repulsion integrals (ERIs) is investigated in detail. The error analysis of the quadrature is performed. New interpolation formula of the roots and weights is presented, which is suitable for the processor of the single-instruction multiple-data type. It is proposed to calculate only small ERIs on GPU. ERIs can be classified efficiently with the upper-bound formula. The algorithm is implemented on NVIDIA GeForce 8800 GTX and the Gaussian 03 program suite. It is applied to the test molecules Taxol and Valinomycin. The total energies calculated are essentially the same as the reference ones. The preliminary results show the considerable speedup over the commodity microprocessor. (Two-electron integral evaluation on the graphics processor unit. Koji Yasuda. Journal of Computational Chemistry. July 5, 2007.)

Page 29 of 47« First...1020...2728293031...40...Last »