GPGPU
General-Purpose Computation Using Graphics Hardware

Introduction

GPGPU stands for General-Purpose computation on GPUs. With the increasing programmability of commodity graphics processing units (GPUs), these chips are capable of performing more than the specific graphics computations for which they were designed. They are now capable coprocessors, and their high speed makes them useful for a variety of applications. The goal of this page is to catalog the current and historical use of GPUs for general-purpose computation.

Contribute
Have some GPGPU News to Contribute? Submit it!

Contact Us


Subscribe to a syndicated RSS feed of GPGPU.
Subscribe to a syndicated RSS feed of GPGPU.

Powered by Blosxom.

Hosted by ibiblio.org

Case studies on GPU usage and data structure design

Abstract: Big improvements in the performance of graphics processing units (GPUs) turned them into a compelling platform for high performance computing. In this thesis, we discuss the usage of NVIDIA's CUDA in two applications -- Einstein@Home, a distributed computing software, and OpenSteer, a game-like application. Our work on Einstein@Home demonstrates that CUDA can be integrated into existing applications with minimal changes, even in programs designed without considering GPU usage. However the existing data structure of Einstein@Home performs poorly when used on the GPU. We demonstrate that using a redesigned data structure improves the performance to about three times as fast as the original CPU version, even though the code executed on the device is not optimized. We further discuss the design of a novel spatial data structure called "dynamic grid" that is optimized for CUDA usage. We measure its performance by integrating it into the Boids scenario of OpenSteer. Our new concept outperforms a uniform grid by a margin of up to 15%, even though the dynamic grid still provides optimization potential. (Case studies on gpu usage and data structure design. J. Breitbart, Master's thesis, Universität Kassel, 2008)

Posted: 11 Aug 2008 [GPGPU /Data Parallel Algorithms] #

High performance computing for deformable image registration: towards a new paradigm in adaptive radiotherapy

This paper described an implementation of fast deformable image registration using GPUs and CUDA in radiation therapy. Using lung and prostate volumetric imaging, the GPU implementation is 40-66 times faster than a single-threaded CPU implementation and 25-41 times faster than a multithreaded implementation. The paradigm of GPU-based near-real-time deformable image registration opens up a host of clinical applications for medical imaging. ( High performance computing for deformable image registration: Towards a new paradigm in adaptive radiotherapy. (Sanjiv S. Samant, Junyi Xia, Pinar Muyan-Özçelik, John D. Owens. Medical physics, 2008.)

Posted: 11 Aug 2008 [GPGPU /Image And Volume Processing] #

Larrabee: A Many-Core x86 Architecture for Visual Computing

Abstract: This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as some fixed function logic blocks. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads. It also greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. A coherent on-die 2nd level cache allows efficient inter-processor communication and high-bandwidth local data access by CPU cores. Task scheduling is performed entirely with software in Larrabee, rather than in fixed function logic. The customizable software graphics rendering pipeline for this architecture uses binning in order to reduce required memory bandwidth, minimize lock contention, and increase opportunities for parallelism relative to standard GPUs. The Larrabee native programming model supports a variety of highly parallel applications that use irregular data structures. Performance analysis on those applications demonstrates Larrabee’s potential for a broad range of parallel computation. (Larrabee: A Many-Core x86 Architecture for Visual Computing. Seiler, L., Carmean, D., Sprangle, D., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P. Proceedings of SIGGRAPH 2008.)

Posted: 04 Aug 2008 [GPGPU /GPUs] #

Faogen 2.0: Ambient occlusion calculation on the GPU

Faogen ia a Fast Ambient Occlusion Generator. It uses a GPU to accelerate computation of ambient occlusion and bent normals both as per-vertex data and in texture images. Faogen 2.0 provides updated ambient aperture and bent normal shaders customizable by editing two simple GLSL functions. Other features include improved precision on large scale models, adjustable background for AO texture images, lighting animation control and bugfixes. (Faogen)

Posted: 04 Aug 2008 [GPGPU /Advanced Rendering] #

Semi-uniform Adaptive Patch Tessellation

This paper by Dyken, Reimers, and Seland of University of Oslo and SINTEF ICT presents an adaptive tessellation scheme for parametric patches producing consistent and watertight tessellations. The scheme uses only a few base tessellations and is particularly well suited for use with instancing. In addition, a novel GPGPU bucket sort approach based on HistoPyramid is presented. The paper gives implementational details and performance benchmarks. (Semi-uniform Adaptive Patch Tessellation. C. Dyken, M. Reimers, and J. Seland. Computer Graphics Forum, to appear.)

Posted: 04 Aug 2008 [GPGPU /Advanced Rendering] #

Real-time Visual Tracker by Stream Processing

This work describes the implementation of a real-time visual tracker that targets the position and 3D pose of objects (specifically faces) in video sequences. The use of GPUs for the computation and efficient sparse-template-based particle filtering allows real-time processing even when tracking multiple faces simultaneously in high-resolution video frames. Using a GPU and the NVIDIA CUDA technology, performance improvements as large as ten times compared to a similar CPU-only tracker are achieved. (Real-time Visual Tracker by Stream Processing. Oscar Mateo Lozano, and Kazuhiro Otsuka. Journal of Signal Processing Systems.)

Posted: 15 Jul 2008 [GPGPU /Image And Volume Processing/Computer Vision] #

Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations (Part 2: Double Precision GPUs)

Abstract:

In a previous publication, we have examined the fundamental difference between computational precision and result accuracy in the context of the iterative solution of linear systems as they typically arise in the Finite Element discretization of Partial Differential Equations (PDEs). In particular, we evaluated mixed- and emulated-precision schemes on commodity graphics processors (GPUs), which at that time only supported computations in single precision. With the advent of graphics cards that natively provide double precision, this report updates our previous results.

We demonstrate that with new co-processor hardware supporting native double precision, such as NVIDIA's G200 and T10 architectures, the situation does not change qualitatively for PDEs, and the previously introduced mixed precision schemes are still preferable to double precision alone. But the schemes achieve significant quantitative performance improvements with the more powerful hardware. In particular, we demonstrate that a Multigrid scheme can accurately solve a common test problem in Finite Element settings with one million unknowns in less than 0.1 seconds, which is truely outstanding performance. We support these conclusions by exploring the algorithmic design space enlarged by the availability of double precision directly in the hardware.

(Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations (Part 2: Double Precision GPUs). Dominik Göddeke and Robert Strzodka. Technical Report, 2008.)

Posted: 14 Jul 2008 [GPGPU /Scientific Computing] #

CUDA.NET

CUDA.NET is an effort by GASS to provide access to NVIDIA CUDA functionality through .NET applications. The library currently provides .NET bindings for CUDA functions, allowing programmers to use existing .NET applications as hosts for CUDA enabled devices, this way exposing a strong co-processor that can be used with .NET. The current distribution contains a .NET library that can be used from any .NET application and language, along with examples in C# and Python showing how to use the library. The API is very straightforward and compatible with the NVIDIA CUDA API available for C applications with few modifications to ease development and align with .NET standards. See the CUDA.NET home page for more details.

Posted: 10 Jul 2008 [GPGPU /Tools] #

NVIDIA appoints first CUDA center of excellence

From the press release:

SANTA CLARA, CA & URBANA, IL—JUNE 30, 2008—NVIDIA Corporation (Nasdaq: NVDA), the worldwide leader in visual computing technologies, and the University of Illinois at Urbana-Champaign (UIUC) today announced that UIUC has been named as the world’s first CUDA Center of Excellence. In addition to the appointment, NVIDIA has donated $500,000 to UIUC for the development of parallel computing facilities and the continuation of its research programs.

“The CUDA Center of Excellence program rewards schools that truly embrace the concept of parallel processing as the future of computing,” said Dr. David Kirk, chief scientist at NVIDIA. “Schools receiving this accreditation integrate the CUDA software environment into their curriculum to help their students harness the capabilities of these new parallel processing architectures. As one of the country’s leading schools in this field, I am personally delighted to appoint UIUC as our first CUDA Center of Excellence.”

The Theoretical and Computational Biophysics Group at UIUC was one of the first research groups to leverage the parallel architecture of the GPU to accelerate their research in the field of computational biophysics. They have successfully accelerated NAMD/VMD – a popular parallel molecular dynamics application that analyzes large biomolecular systems. It is hoped that this donation will aid this group, and others at the university, to further their work and speed them down the path to great discovery.

(Complete Press Release)

Posted: 04 Jul 2008 [GPGPU /Miscellaneous/Research Groups] #

PRACE award presented to young scientist at ISC’08 for GPGPU work

From this article: "PRACE, Partnership for Advanced Computing in Europe, awarded a prize for the best scientific paper submitted to ISC’08 by a European student or young scientist on petascaling. The authors of the award winning paper are Stefan Turek, Dominik Göddeke, Christian Becker, Sven H.M. Buijssen and Hilmar Wobker from the Institute of Applied Mathematics, Dortmund University of Technology, Germany. Their work, UCHPC – UnConventional High Performance Computing for Finite Element Simulations, was selected by the ISC’08 Award Committee, headed by Michael Resch, High Performance Computing Center Stuttgart. Achim Bachem, Chairman of the Board Forschungszentrum Jülich and PRACE coordinator presented the PRACE Award at the ISC’08 opening ceremony in Dresden on Wednesday, 18 June. Dominik Göddeke, Ph.D. student in the team of Professor Stefan Turek will receive a sponsorship for the participation in a conference relevant to Petascale computing." Dominik has been an active GPGPU researcher for several years, and is one of the most active and helpful contributors to the GPGPU.org forums. (PRACE award presented to young scientist at ISC’08)

Posted: 20 Jun 2008 [GPGPU /Scientific Computing] #

Co-Processor Acceleration of an Unmodified Parallel Solid Mechanics Code with FEASTGPU

FEAST is a hardware-oriented MPI-based Finite Element solver toolkit. With the extension FEASTGPU the authors have previously demonstrated that significant speed-ups in the solution of the scalar Poisson problem can be achieved by the addition of GPUs as scientific co-processors to a commodity based cluster. In this paper the authors put the more general claim to the test: Applications based on FEAST, that ran only on CPUs so far, can be successfully accelerated on a co-processor enhanced cluster without any code modifications. The chosen solid mechanics code has higher accuracy requirements and a more diverse CPU/co-processor interaction than the Poisson example, and is thus better suited to assess the practicability of the acceleration approach. The paper presents accuracy experiments, a scalability test and acceleration results for different elastic objects under load. In particular, it demonstrates in detail that the single precision execution of the co-processor does not affect the final accuracy. The paper establishes how the local acceleration gains of factors 5.5 to 9.0 translate into 1.6- to 2.6-fold total speed-up. Subsequent analysis reveals which measures will increase these factors further. (Dominik Göddeke, Hilmar Wobker, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick McCormick, Stefan Turek. Co-Processor Acceleration of an Unmodified Parallel Solid Mechanics Code with FEASTGPU. International Journal of Computational Science and Engineering (to appear).)

Posted: 06 Jun 2008 [GPGPU /Scientific Computing] #

ISC 2008 Tutorial: High Performance Computing with CUDA

In this tutorial, NVIDIA engineers and academic and industrial researchers will present CUDA and discuss its advanced use for science and engineering. The tutorial will demonstrate CUDA with traditional HPC examples including BLAS, FFT, and integration with Fortran and high-level languages (MATLAB, Mathematica, Python) and describe in detail the programming model at the heart of it all. It will then turn to advanced topics including optimizing CUDA programs, CUDA floating point performance and accuracy, and CUDA programming strategies and tips. Finally the tutorial will present detailed case studies in which domain scientists will describe their experience using CUDA to accelerate mature, deployed, real-world science codes. Scientists throughout industry and academia are already using CUDA to achieve dramatic speedups on production and research codes (see http://www.nvidia.com/cuda for a list of codes, academic papers and commercial packages based on CUDA). Presenters include Massimiliano Fatica (NVIDIA), Mark Harris (NVIDIA), Patrick LeGresley (NVIDIA), and Jim Phillips (UIUC). Follow this link to register.

Posted: 03 Jun 2008 [GPGPU /Conferences] #

1st Annual UMD GPGPU Programming Contest

The University of Maryland are sponsoring a GPGPU programming contest. All entries will be released under version 3 of the GPL at the conclusion of the contest. Contestants are asked to submit code for sparse matrix multiplication. UMD will be evaluating entries on both vector/sparse matrix and sparse matrix/sparse matrix multiplications, using a variety of different inputs. As the contest progresses, UMD will update the LeaderBoard regularly, so contestants will have some idea of where they stand. Contestants are welcome to make as many entries as they want, so submit early and then tweak your designs. Entries can be written in either GLSL or CUDA. Prizes include NVIDIA Quadro FX 5600 GPUs, sponsored by NVIDIA. (http://scriptroute.cs.umd.edu/gpucompete/)

Posted: 28 May 2008 [GPGPU /Contests] #

A Fast Similarity Join Algorithm Using Graphics Processing Units

This paper by Lieberman et al. at the University of Maryland describes an application of GPU processing to the similarity join, a common operation in spatial databases. A similarity join takes two sets of points A, B and returns pairs pA, qB where the distance D(p,q) ≤ ε. The similarity join is a common spatial database operation with many applications. An algorithm named LSS is presented that executes on a GPU, taking advantage of the GPU's parallelism and large data throughput. To achieve peak efficiency, LSS relies only on simple primitive operations that execute quickly on the GPU, such as the sorting and searching of arrays. It recasts the similarity join as a sort-and-search problem by mapping its input datasets onto a set of space-filling curves, generated by a parallel sort routine on the GPU. It then searches small intervals of these curves that are guaranteed to contain all pairs of the correct result. LSS offers a balance between time and work efficiencies and is shown to perform well when compared against existing prominent high-dimensional similarity join methods. (M. D. Lieberman, J. Sankaranarayanan, and H. Samet. A fast similarity join algorithm using graphics processing units. In Proceedings of the 24th IEEE International Conference on Data Engineering, pages 1111-1120, Cancun, Mexico, April 2008.)

Posted: 25 May 2008 [GPGPU /Data Parallel Algorithms] #

Multiscale and local search methods for real time region tracking with particle filters: local search driven by adaptive scale estimation on GPUs

This paper by Cabido et al. presents a real-time object tracking algorithm, based on the hybridization of particle filtering (PF) and a multi-scale local search (MSLS) algorithm, for both CPU and GPU architectures. The developed system provides successful results in precise tracking of single and multiple targets in monocular video, operating in real-time at 70 frames per second for 640 × 480 video resolutions on the GPU, up to 1,100% faster than the CPU version of the algorithm. (Multiscale and local search methods for real time region tracking with particle filters: local search driven by adaptive scale estimation on GPUs. Raul Cabido, Antonio S. Montemayor, Juan Jose Pantrigo, and Bryson R. Payne. Machine Vision and Applications, Springer, 2008.)

Posted: 25 May 2008 [GPGPU /Image And Volume Processing/Computer Vision] #

GPU acceleration of cutoff pair potentials for molecular modeling applications

The advent of systems biology requires the simulation of ever-larger biomolecular systems, demanding a commensurate growth in computational power. This paper examines the use of the NVIDIA Tesla C870 graphics card programmed through the CUDA toolkit to accelerate the calculation of cutoff pair potentials, one of the most prevalent computations required by many different molecular modeling applications. The paper presents algorithms to calculate electrostatic potential maps for cutoff pair potentials. Whereas a straightforward approach for decomposing atom data leads to low computational efficiency, a new strategy enables fine-grained spatial decomposition of atom data that maps efficiently to the C870's memory system while increasing work efficiency of atom data traversal by a factor of 5. The memory addressing flexibility exposed through CUDA's SPMD programming model is crucial in enabling this new strategy. An implementation of the new algorithm provides a greater than threefold performance improvement over our previously published implementation and runs 12 to 20 times faster than optimized CPU-only code. The lessons learned are generally applicable to algorithms accelerated by uniform grid spatial decomposition. (C. I. Rodrigues, D. J. Hardy, J. E. Stone, K. Schulten, W. W. Hwu., GPU acceleration of cutoff pair potentials for molecular modeling applications. Proceedings of the 2008 Conference On Computing Frontiers, pp.273-282, 2008.) (http://www.ks.uiuc.edu/Research/gpu/)

Posted: 25 May 2008 [GPGPU /Scientific Computing] #

GPU Computing

Abstract: "The graphics processing unit (GPU) has become an integral part of today's mainstream computing systems. Over the past six years, there has been a marked increase in the performance and capabilities of GPUs. The modern GPU is not only a powerful graphics engine but also a highly parallel programmable processor featuring peak arithmetic and memory andwidth that substantially outpaces its CPU counterpart. The GPU's rapid increase in both programmability and capability has spawned a research community that has successfully mapped a broad range of computationally demanding, complex problems to the GPU. This effort in general-purpose computing on the GPU, also known as GPU computing, has positioned the GPU as a compelling alternative to traditional microprocessors in high-performance computer systems of the future. We describe the background, hardware, and programming model for GPU computing, summarize the state of the art in tools and techniques, and present four GPU computing successes in game physics and computational biophysics that deliver order-of-magnitude performance gains over optimized CPU applications. (J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, J. C. Phillips, "GPU Computing", Proceedings of the IEEE, vol.96, no.5, pp.879-899, May 2008)

Posted: 25 May 2008 [GPGPU /Scientific Computing] #

CIGPU 5 June 2008 Hong Kong additional technical discussion

In addition to the papers already announced, Dr. Simon Harding (Memorial University, Newfoundland) and Dr. Tien-Tsin Wong (The Chinese University of Hong Kong) will lead a discussion on the practicalities of running evolution on modern graphics cards. They will contrast the current leading GPGPU tools considering ease of use, and support for debugging and performance monitoring. CIGPU will close with a short session considering the future of computational intelligence on GPUs.

Posted: 25 May 2008 [GPGPU /Conferences] #

Graph Layout on the GPU

A graph is an ordered pair G=(V,E) where V is a set of nodes and E is a set of edges connecting nodes. Graph drawing addresses the problem of creating geometric representations of graphs. Unlike matrices or images, graphs are unstructured and hence graph layout does not seem to be suitable for acceleration on the GPU. We present two GPU-accelerated graph drawing algorithms which are able to quickly compute aesthetic layouts of large graphs. One is for the layout of a single graph and the other is for computing stable layouts of a sequence of graphs. Speedups of 5.5x to 17x relative to a CPU implementation are demonstrated. (Yaniv Frishman and Ayellet Tal, Multi-Level Graph Layout on the GPU, IEEE Transactions on Visualization and Computer Graphics (Proceedings Information Visualization 2007), 13(6):1310-1317, 2007)
(Yaniv Frishman and Ayellet Tal, Online Dynamic Graph Drawing, accepted to IEEE Transactions on Visualization and Computer Graphics)

Posted: 25 May 2008 [GPGPU /Data Parallel Algorithms] #

gDEBugger V4.1 Adds Geometry Shaders Support and new ATI Performance Metrics Integration

The new gDEBugger V4.1 adds Geometry Shader Support and enables developers to view allocated geometry shader objects, shader source code and properties. It also allows the developer to Edit and Continue shaders “on the fly”. Support for the new ATI (AMD) driver performance metrics infrastructure has been added. This integration enables users to view ATI performance metrics such as hardware utilization, vertex wait for pixel, pixel wait for vertex, overdraw and more. These performance metrics together with gDEBugger’s Performance Analysis Toolbar provide a powerful solution for locating graphics system performance bottlenecks. gDEBugger, an OpenGL and OpenGL ES debugger and profiler, traces application activity on top of the OpenGL API, letting programmers see what is happening within the graphics system implementation to find bugs and optimize OpenGL application performance. gDEBugger runs on Microsoft Windows and Linux operating systems. (http://www.gremedy.com)

Posted: 25 May 2008 [GPGPU /Tools] #

Categories