In this ClusterMonkey article, Andrew Humber, Senior PR Manager for Tesla and CUDA Technologies at NVIDIA Corporation, summarizes the events that made 2008 a truly exciting year for GPU Computing. (A Year in Review from the NVIDIA Tesla Team, ClusterMonkey)
In this paper, the authors present a library, named Sapporo, which closely emulates the GRAPE-6 API. The library is written in CUDA and implements most common functions that are used in N-body codes supporting GRAPE-6. As a result such codes will be able to use Sapporo without modification to their source code. The library also supports use of multiple GPUs per host. The authors carried out a series systematic tests to test the performance, accuracy and ability of the library to handle a realistic N-body problem. They found the performance of the library with a single G80/G92 GPU is a factor of two higher than that of GRAPE-6A(BLX) PCI(X)-cards, and the sustained performance with 2x GeForce 9800GX2 cards is on par with a 32-chip GRAPE-6 system (about 800 GFlop/s). The accuracy of the library is comparable to that of GRAPE-6 hardware, and its ability to correctly solve a realistic N-body problem provides an alternative for GRAPE-6 special purpose hardware.
(Evghenii Gaburov, Stefan Harfst and Simon Portegies Zwart, SAPPORO: A way to turn your graphics cards into a GRAPE-6, Submitted to New Astronomy)
This paper explores the challenges in implementing a message passing interface usable on systems with data-parallel processors. As a case study, we design and implement the “DCGN” API on NVIDIA GPUs that is similar to MPI and allows full access to the underlying architecture. We introduce the notion of data-parallel thread-groups as a way to map resources to MPI ranks. We use a method that also allows the data-parallel processors to run autonomously from user-written CPU code. In order to facilitate communication, we use a sleep-based polling system to store and retrieve messages. Unlike previous systems, our method provides both performance and flexibility. By running a test suite of applications with different communication requirements, we find that a tolerable amount of overhead is incurred, somewhere between one and five percent depending on the application, and indicate the locations where this overhead accumulates. We conclude that with innovations in chipsets and drivers, this overhead will be mitigated and provide similar performance to typical CPU based MPI implementations while providing fully-dynamic communication.
(Jeff A. Stuart and John D. Owens, Message Passing on Data-Parallel Architectures, Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium)
This article by Jeff Layton at ClusterMonkey summarizes the history of GPU Computing in terms of high-level programming languages and abstractions, from the early days of GPGPU programming using graphics APIs, to Stream, CUDA and OpenCL. The second half of the article provides an introduction to the PGI 8.0 Technology Preview, which allows the use of pragmas to automatically parallelize and run compute-intensive kernels in standard C and Fortran code on accelerators like GPUs. (GPU Programming For the Rest Of Us, Jeff Layton, ClusterMonkey.net)
This IPDPS 2009 paper by Nadathur Satish, Mark Harris, and Michael Garland describes the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by NVIDIA CUDA. The radix sort described is the fastest GPU sort and the merge sort described is the fastest comparison-based GPU sort reported in the literature. The radix sort is up to 4 times faster than the graphics-based GPUSort and greater than 2 times faster than other CUDA-based radix sorts. It is also 23% faster, on average, than even a very carefully optimized multicore CPU sorting routine. To achieve this performance, the authors carefully design the algorithms to expose substantial fine-grained parallelism and decompose the computation into independent tasks that perform minimal global communication. They exploit the high-speed on-chip shared memory provided by NVIDIA’s GPU architecture and efficient data-parallel primitives, particularly parallel scan. While targeted at GPUs, these algorithms should also be well-suited for other manycore processors. (N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs. Proc. 23rd IEEE Int’l Parallel & Distributed Processing Symposium, May 2009. To appear.)
The new High-Performance Graphics Conference is the synthesis of two highly-successful conference series:
- Graphics Hardware, an annual conference focusing on graphics hardware, architecture, and systems since 1986, and
- Interactive Ray Tracing, an innovative conference series focusing on the emerging field of interactive ray tracing since 2006.
By combining these two conferences, High-Performance Graphics aims to bring to authors and attendees the best of both, while extending the scope of the new conference to cover the overarching field of performance-oriented graphics systems covering innovative algorithms, efficient implementations, and hardware architecture. This broader focus offers a common forum bringing together researchers, engineers, and architects to discuss the complex interactions of massively parallel hardware, novel programming models, efficient graphics algorithms, and innovative applications.
Paper submissions are due April 30th. For more information see the High-Performance Graphics Website.
Alexander Heusel of the University of Frankfurt has released open source Java bindings for CUDA. The current project state is alpha, with support for the CUDA driver API, and support for the CUBLAS and CUFFT libraries is pending. Contributions are welcome. For more information see the project website: http://jacuzzi.sourceforge.net
To be held March 30-31, 2009 in Berkeley, California, HotPar ’09 will bring together researchers and practitioners doing innovative work in the area of parallel computing. HotPar recognizes the broad impact of multicore computing and seeks relevant contributions from all fields, including application design, languages and compilers, systems, and architecture. (http://www.usenix.org/events/hotpar09/)
The new gDEBugger V4.5 adds the ability to view texture MIP-map levels. Each texture MIP-map level’s parameters and data (as an image or raw data) can be displayed in the gDEBugger Texture and Buffers viewer. Browse the different MIP-map levels using the Texture MIP-map Level slidergDEBugger V4.5 also introduces support for 1D and 2D texture arrays. The new Textures and Buffers viewer Texture Layer slider enables viewing the contents of different texture layers. This version also introduces notable performance and stability improvements.
gDEBugger, an OpenGL and OpenGL ES debugger and profiler, traces application activity on top of the OpenGL API and lets programmers see what is happening within the graphics system implementation to find bugs and optimize OpenGL application performance. gDEBugger runs on Windows and Linux operating systems, and is currently in Beta phase on Mac OS X.
OpenMM is a freely downloadable, high performance, extensible library that allows molecular dynamics (MD) simulations to run on high performance computer architectures, such as graphics processing units (GPUs). Significant performance speedups of 100 times were achieved in some cases by running OpenMM on GPUs in desktop PCs (vs CPU). The new release includes a version of the widely used MD package GROMACS that integrates the OpenMM library, enabling acceleration on high-end NVIDIA and AMD/ATI GPUs. OpenMM is a collaborative project between Vijay Pande’s lab at Stanford University and Simbios, the National Center for Physics-based Simulation of Biological Structures at Stanford, which is supported by the National Institutes of Health. For more information on OpenMM, go to http://simtk.org/home/openmm. (Full press release.)