Supercomputing '06 Workshop: "General-Purpose GPU Computing: Practice And Experience"

Tampa Convention Center
Tampa, Florida

SC’06 is proud to announce the “General-Purpose GPU Computing: Practice and Experience” workshop.  Advances and innovation in graphics processing unit (GPU) capabilities and functionality have expanded the GPU problem-solution space to non-traditional, general-purpose computing as a parallel vector/matrix coprocessor.  Examples include game physics, image processing, scientific computing, sorting and database query processing, to name a few.

This workshop features invited speakers and poster presenters who provide insights into current GPGPU practice and experience, and chart future directions in heterogeneous and homogeneous multi-core processor architectures and data-parallel processor architectures such as GPUs.  Developing and adapting software to exploit the GPU’s highly parallel capabilities presents numerous implementation challenges.  These challenges motivate a variety of approaches that have been used to more fully integrate GPUs and CPUs to achieve high computation throughput.  New and similar software and integration challenges face the multi-core, stream and data-parallel processor future.  The solutions to these challenges will undoubtedly build upon the GPGPU practice and experience foundation.

The workshop speaker list, in alphabetical order:

  • Ian Buck, NVIDIA Corporation
  • Frederica Darema, National Science Foundation
  • Dominik Goeddeke, University of Dortmund/Robert Strzodka, Stanford University
  • Mary Hall, USC/ISI
  • Los Alamos National Laboratories “Roadrunner” Supercomputer Team
  • Dinesh Manocha, UNC Chapel Hill
  • Matthew Papakipos, CTO, Peakstream
  • Michael Paolini, IBM
  • Ryan N. Schneider, CTO, Acceleware
  • Mark Segal, ATI
  • Burton Smith, Microsoft Corporation
  • Marc Tremblay, Sun Microsystems

The topics addressed by the speakers range from current GPGPU practice and experience to future issues and research areas in parallel computing currently being driven by GPGPU innovations and lessons learned, such as the IBM Cell Broadband Engine and Sun Microsystem’s Niagara/Sun4v processor.

Plan on attending SC’06 “General-Purpose GPU Computing: Practice and Experience” for an informative and information-filled workshop in this exciting area!

Workshop Schedule and Program

Time Speaker
8:00am B. Scott Michel
The Aerospace Corporation
Morning Workshop Introduction
8:10am Frederica Darema,
National Science Foundation
Advances in Systems Software for Emerging Computer Systems
9:00am Break
9:10am Burton Smith
Microsoft Corporation
Reinventing Computing
10:00am Break
10:10am Marc Tremblay
Sun Microsystems
Multithreaded Multicores: An Update From Sun
10:50am Dinesh Manocha
UNC Chapel Hill
11:20am Mary Hall
USC/ISI
Strategies for High-Performance Heterogeneous Applications: A Compiler Perspective
12:00am Lunch
1:00pm B. Scott Michel
The Aerospace Corporation
Afternoon Workshop Introduction
1:10pm Dominik Goeddeke
University of Dortmund

Robert Strzodka
Stanford University

Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations
1:50pm Ian Buck
nVidia Corporation
GeForce 8800 & NVIDIA CUDA: A New Architecture for Computing on the GPU
2:30pm Ryan N. Schneider
Acceleware, Inc.
“Video Games” People Play at Work
2:50pm Break
3:00pm Mark Segal
AMD/ATI
Graphics Hardware, Graphics APIs and Computation on GPUs
3:30pm Matt Papakipos
Peakstream, Inc.
Stream Programming on the PeakStream Platform
4:00pm Break
4:10pm Mike Paolini
IBM
Cell Broadband Engine Processor
4:50pm Allen McPherson and John Turner
Los Alamos National Laboratories
The LANL “Roadrunner” Supercomputer
5:30pm End of Workshop

Posters

Amos Anderson
William A Goddard, III
Peter SchroderCalifornia Institute of Technology
Matrix Multiplication and Quantum Monte Carlo (QMC) on Graphical Processing Units (GPUs)

GPUs have been widening their multimedia acceleration applicability to include general purpose computations in recent years. With the addition of enhanced functionality on the hardware and improved software interface, it has become a processor worth considering for scientific calculations. Although the GPU is not necessarily suitable for arbitrary types of computations, the range of candidates is growing. Our QMC software has been studied for compatibility with the GPU. This involves matrix multiplication as well as a couple of quantum chemistry specific kernels.  more…

David. A. Bader
Virat Agarwal
Kamesh MadduriCollege of Computing, Georgia Institute of Technology
Efficient Implementation of Irregular Algorithms on Cell Multi-core Architecture

The Cell Broadband Engine is a novel architectural design by Sony, Toshiba, and IBM, primarily targeting high performance multimedia and gaming applications. Recent results show that the Cell Architecture is well suited for scientific applications that exhibit predictable memory access patterns, and where communication and computation can be overlapped more effectively than on conventional cache-based architectures. In this work, we consider memory-intensive applications that have low degree of locality. We design and implement an efficient algorithm for list ranking, a representative problem from the class of combinatorial and graph-theoretic applications. Due to its highly irregular memory access patterns, list ranking is a particularly challenging problem to solve efficiently on current cache-based architectures. We describe a generic work-partitioning technique on the Cell to hide memory access latency, and apply this to efficiently implement list ranking. Our simulation results corroborate the latency-hiding algorithmic technique, and we demonstrate a substantial speedup for list ranking on the Cell in comparison to traditional cache-based microprocessors.

Raúl Cabido1
Antonio S. Montemayor1
Juan José Pantrigo1
Bryson R. Payne21Universidad Rey Juan Carlos (Madrid, SPAIN)
2North Georgia College & State University (USA)
Scalable Particle Filter Framework for Visual Tracking

In this paper, we present a work-in-progress toward multiple object tracking exploiting the GPU as the main processor. This work is based on new Shader Model 3.0 capabilities and recent research on GPU tracking, such as [Montemayor et al. 2006], extending the previous scope to a scalable and reusable framework.  more…

Matthew Curry
Anthony SkjellumUniversity of Alabama, Birmingham
Improved LU Decomposition on Graphics Hardware

In November 2005, the Gamma group at the Univeristy of North Carolina at Chapel Hill presented LUGPU, a software package which performs LU decomposition on a graphics processing unit (GPU). This software enables a significant speed increase over traditional CPU-targeted implementations, including those provided by ATLAS and LAPACK. However, inefficiencies prevent LUGPU from obtaining higher performance. One major limiting factor is its use of communication between the CPU and GPU during the pivoting phase of computation.  more…

William R. Dieter
Henry G. Dietz
B. Dalton Young
Kungyen ChangoUniversity of Kentucky
Maintaining Accuracy In Large-Scale Computations

Each increase in computational ability enables computers to conquer more difficult problems. Such problems typically increase not only the amount of data, but also the number of operations used to produce each result. Thus, small inaccuracies in individual operations are increasingly likely to compound, yielding results with unacceptably poor accuracy. Over the past few years, our GPU research has centered on methods that allow the high speed and modest precision of GPUs to be utilized without compromizing accuracy.  more…

Zhe Fan
Feng Qiu
Arie Kaufman
Computer Science Department and Center for Visual Computing (CVC)
Stony Brook University
ZippyGPU: Programming Toolkit for General-Purpose Computation on GPU Clusters

A GPU cluster is a distributed-memory architecture. Programming it requires substantial experience in parallel programming and GPU programming. Especially, the programmer needs to deal with the network communication and CPU-GPU data transfers explicitly. To implement complex applications running ef¿ciently on GPU clusters can be tedious and dif¿cult for programmers. In this work, we present ZippyGPU, a toolkit that facilitates programming of general-purpose computation on GPU clusters.  more…

Dan Fay1
Ali Sazegari1
Dan Connors21Apple Computer, Inc. 2University of Colorado
A Detailed Study of the Numerical Accuracy of GPU-Implemented Math Functions

Modern programmable GPUs have demonstrated their ability to significantly accelerate important classes of non-graphics applications; however, GPUs’ substandard support for floating-point arithmetic can severely limit their usefulness for general-purpose computing. Current GPUs do not support double-precision computation and their single-precision support glosses over important aspects of the IEEE-754 floating-point standard, such as correctly rounded results and proper closure of the number system. Additionally, numerical consistency needs to exist between different GPU vendors, different GPU software platforms (shader language compiler, driver and operating system), and vendors’ GPU families.  more…

Takeshi Hakamata
Thomas Caudell
Edward AngelUniversity of New Mexico
Force-Directed Graph Layout using the GPU

Graph layout has had important applications in many areas of computer science. When dealing with machine generated data, we often want to see the data to have better understanding of the structure or organization. Many such data are represented by graphs. By laying out a graph, we can untangle information and intuitively show relations of objects.

We present a force-directed graph layout of a 3D graph using the GPU and show the performance advantages of using the GPU over the CPU. The main challenge to implement the algorithm is how to represent a graph using texture maps to make the computation possible and efficient on the GPU.  more…

Eric Petit
Sebastien Matz
Francois BodinIRISA-INRIA-University of Rennes
Partitioning Programs for Automatically Exploiting GPU

Because of their high potential computing power the use of graphical processing units looks very attractive to speed up programs. However, because of their idiosyncrasies they are difficult to program. Furthermore data transfers between the main memory to the GPU may strongly impact on the resulting performance. Until recently, most previous works have been considering porting algorithm on GPU by hand. Other studies have been focusing on provided better programming tools for GPU. However to our knowledge no works have been addressing partitioning C programs for GPU. A first step to automatically exploit GPU in the context of general programming is to be able to focus the effort on pieces of code that fit GPU constraints.  more…

Thorolf TonjumNith Hansaparken Bergen (Norway) A Generic Modelling Framework for Complex Systems Synthesis

This paper aims to show how a unified modelling framework for rapid prototyping and simulation based on Tunable Excitable Media, can empower a great range of interdisciplinary fields with the necessary tools to perform complex system research.

Excitable media is a collection of nodes where each node is linked to other nodes by a certain linkage scheme. Each node transmit signals in the form of objects to its connections. The signals arrive at their destination node where they are computed as the input to the node’s kernel function. The results from the node’s kernel function are then redistributed back into the media. This creates the basis for iterative systems with complex inter-promulgations, and is as such an ideal architecture to run complex systems on.   more…

Feng Qiu1
Haik Lorenz2
Jin Zhou1
Zhe Fan1
Ye Zhao1
Arie Kaufman1
Klaus Mueller11Stony Brook University
2University of Potsdam, Germany
GPU-based Visual Simulation of Dispersion in Urban Environments

Expeditious response to airborne hazardous releases in urban environments requires the availability of the dispersion distribution. We have developed an interactive system of simulating and visualizing dispersion contaminants in dense urban environments on a GPU and GPU cluster.  more…