The complete course notes from the “Parallel Computing for Graphics: Beyond Programmable Shading” SIGGRAPH Asia 2008 course , are available online. The course gives an introduction to parallel programming architectures and environments for interactive graphics and explores case studies of combining traditional rendering API usage with advanced parallel computation from game developers, researchers, and graphics hardware vendors. There are strong indications that the future of interactive graphics involves a programming model more flexible than today’s OpenGL and Direct3D pipelines. As such, graphics developers need a basic understanding of how to combine emerging parallel programming techniques with the traditional interactive rendering pipeline. This course gives an introduction to several parallel graphics architectures and programming environments, and introduces the new types of graphics algorithms that will be possible. The case studies in the class discuss the mix of parallel programming constructs used, details of the graphics algorithms, and how the rendering pipeline and computation interact to achieve the technical goals. The course speakers are Jason Yang and Justin Hensley (AMD), Tim Foley (Intel), Mark Harris (NVIDIA), Kun Zhou (Zhejiang University), Anjul Patney (UC Davis), Pedro Sander (HKUIST), and Christopher Oat (AMD) (Complete course notes.)
DECEMBER 19, 2008- NVIDIA has announced the availability of version 2.1 beta of its CUDA toolkit and SDK. This is the latest version of the C-compiler and software development tools for accessing the massively parallel CUDA compute architecture of NVIDIA GPUs. In response to overwhelming demand from the developer community, this latest version of the CUDA software suite includes support for NVIDIA®® Tesla™ GPUs on Windows Vista and 32-bit debugger support for CUDA on RedHat Enterprise Linux 5.x (separate download).
The CUDA Toolkit and SDK 2.1 beta includes support for VisualStudio 2008 support on Windows XP and Vista and Just-In-Time (JIT) compilation for applications that dynamically generate CUDA kernels. Several new interoperability APIs have been added for Direct3D 9 and Direct3D 10 that accelerate communication to DirectX applications as well as a series of improvements to OpenGL interoperability.
CUDA Toolkit and SDK 2.1 beta also features support for using a GPU that is not driving a display on Vista, a beta of Linux Profiler 1.1 (separate download) as well as support for recent releases of Linux including Fedora9, OpenSUSE 11 and Ubuntu 8.04.
CUDA Toolkit and SDK 2.1 beta is available today for free download from www.nvidia.com/object/cuda_get.
Equalizer Graphics have announced the release of Equalizer 0.6, a major advance in parallel OpenGL rendering. Equalizer is middleware for creating parallel OpenGL-based applications, including GPGPU applications. It enables applications to benefit from multiple graphics cards, processors and computers to scale rendering performance, visual quality and display size. Equalizer 0.6 adds support for Automatic load-balancing for 2D and DB decompositions, DPlex (time-multiplex) compounds, and Paracomp compositing backend. See the release notes on the Equalizer website for a comprehensive list of new features, enhancements, optimizations and bug fixes.
This paper aims at bridging the gap between the lack of synchronization mechanisms in recent graphics processor (GPU) architectures and the need of synchronization mechanisms in parallel applications. Based on the intrinsic features of recent GPU architectures, the authors construct strong synchronization objects like wait-free and t-resilient read-modify-write objects for a general model of recent GPU architectures without strong hardware synchronization primitives like test-and-set and compare-and-swap. Accesses to the new wait-free objects have time complexity O(N), where N is the number of concurrent processes. The wait-free objects have space complexity O(N^2), which is optimal. Our result demonstrates that it is possible to construct wait-free synchronization mechanisms for GPUs without the need of strong synchronization primitives in hardware and that wait-free programming is possible for GPUs.
(Wait-free programming for general purpose computations on graphics processors. Phuong Hoai Ha, Philippas Tsigas, and Otto J. Anshus. ACM Symposium on Principles of Distributed Computing, 2008.)
The complete course notes from the “Beyond Programmable Shading” SIGGRAPH 2008 course , are available online. The course gives an introduction to parallel programming architectures and environments for interactive graphics and explores case studies of combining traditional rendering API usage with advanced parallel computation from game developers, researchers, and graphics hardware vendors. There are strong indications that the future of interactive graphics involves a programming model more flexible than today’s OpenGL and Direct3D pipelines. As such, graphics developers need a basic understanding of how to combine emerging parallel programming techniques with the traditional interactive rendering pipeline. This course gives an introduction to several parallel graphics architectures and programming environments, and introduces the new types of graphics algorithms that will be possible. The case studies in the class discuss the mix of parallel programming constructs used, details of the graphics algorithms, and how the rendering pipeline and computation interact to achieve the technical goals. The course organizers are Aaron Lefohn (Intel) and Mike Houston (AMD). Additional course speakers include Kayvon Fatahalian (Stanford), David Luebke (NVIDIA), Tom Forsyth (Intel), John Owens (UC Davis), Chas Boyd (Microsoft), Aaftab Munshi (Apple), Fabio Pellacini (Dartmouth), Jon Olick (Id Software), Matt Pharr (Intel), and Jeremy Shopf (AMD). (Complete course notes)
As the computing power of various platforms intended for games and similar applications is increasing rapidly, they attract the interest of professionals in the HPC community. As an example, modern graphics processing units (GPUs) are often used for HPC in GPGPU. Another example is the Cell Broadband Engine of the Playstation3 (PS3) that has a multicore architecture that lends itself for HPC. These platforms are not conventional HPC platforms; nonetheless they are used for HPC purposes and even clusters of such computing resources are being built with great success. Both the computing power and the low cost compared to conventional HPC resources make them very interesting. The aim of this workshop is to focus on such unconventional resources for HPC. Only imagination sets the limit for the kinds of devices that can be used for HPC end even be combined to form clusters. (UCHPC ’09 Website, Call for Papers)
The Khronos™ Group today announced the ratification and public release of the OpenCL™ 1.0 specification, the first open, royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. OpenCL (Open Computing Language) greatly improves speed and responsiveness for a wide spectrum of applications in numerous market categories from gaming and entertainment to scientific and medical software. Proposed six months ago as a draft specification by Apple, OpenCL has been developed and ratified by industry-leading companies including 3DLABS, Activision Blizzard, AMD, Apple, ARM, Barco, Broadcom, Codeplay, Electronic Arts, Ericsson, Freescale, HI, IBM, Intel Corporation, Imagination Technologies, Kestrel Institute, Motorola, Movidia, Nokia, NVIDIA, QNX, RapidMind, Samsung, Seaweed, TAKUMI, Texas Instruments and Umeå University. The OpenCL 1.0 specification and more details are available at http://www.khronos.org/opencl/
At Khronos “Developer University” today at SIGGRAPH Asia in Singapore, Khronos members publicly launched OpenCL 1.0 with a presentation of the specification and source code examples.
The new gDEBugger V4.4 adds in-depth analysis of OpenGL memory usage by tracking graphics memory allocated objects, their memory consumption and allocation call stacks. Also new in this version are graphics memory leak detection and the ability to break on them.
Using these new features will enable OpenGL and OpenGL ES developers to optimize their applications’ memory consumption and improve overall application performance.
gDEBugger, an OpenGL and OpenGL ES debugger and profiler, traces application activity on top of the OpenGL API, lets programmers see what is happening within the graphics system implementation to find bugs and optimize OpenGL application performance. gDEBugger runs on Windows and Linux operating systems. (Graphic Remedy Website)
At SC08, Aggregate.Org/University of Kentucky demonstrated open source technology for running arbitrary MIMD programs directly on GPUs. There are two environments for MOG, a simulator which interprets the MIMD code and a “Meta-State Converter” compilation system which does state space transformation of MIMD code into pure (SIMD) native GPU code. Applying the current version of either, MIMD C code using shared memory communication can do recursion, etc., while running on a CUDA GPU. Support for both C and Fortran, with both shared memory and MPI for communications, and support of both NVIDIA CUDA and ATI CAL targets, is planned. The work is very new, but detailed publications, performance benchmarks, and code releases are expected to start to appear by early next year. (MOG at SC08)
This is a GPGPU event a long time in the making. Since the advent of general-purpose APIs and compilers for GPUs it has been predicted that GPUs would one day be used to help boost the performance of Supercomputers. With the latest release of the Top500 Supercomputer list, that prediction has become a reality.
More details from an NVIDIA press release:
NVIDIA Tesla Powers 29th Most Powerful Supercomputer in the World
SC08—AUSTIN, TX—NOVEMBER 17, 2008—The Tokyo Institute of Technology (Tokyo Tech) today announced a collaboration with NVIDIA to use NVIDIA® Tesla™ GPUs to boost the computational horsepower of its TSUBAME supercomputer. Through the addition of 170 Tesla S1070 1U systems, the TSUBAME supercomputer now delivers nearly 170 TFLOPS of theoretical peak performance, as well as 77.48 TFLOPS of measured Linpack performance, placing it, again, amongst the top ranks in the world’s Top 500 Supercomputers.
“Tokyo Tech is constantly investigating future computing platforms and it had become clear to us that to make the next major leap in performance, TSUBAME had to adopt GPU computing technologies,” said Satoshi Matsuoka, division director of the Global Scientific Information and Computing Center at Tokyo Tech. “In testing our key applications, the Tesla GPUs delivered speed-ups that we had never seen before, sometimes even orders of magnitude – a tremendous competitive boost for our scientists and engineers in reducing their time to solution.”
Speaking to the ease of implementation, Matsuoka continued,
“The entire upgrade was carried out in 1 week, and the TSUBAME supercomputer remained live throughout. This is an unprecedented feat in top-level supercomputing.”