Brook+, AMD’s extension of the BrookGPU programming environment, has been released in full source code to SourceForge. Brook+ supports an ATI CAL and x86 CPU backend, and allows developers to program GPUs in a C-like stream computing language.
The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multicore architectures. This model allows programmers to specify the structure of a program as a set of filters that act upon data, and a set of communication channels between them. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on modern Graphics Processing Units (GPUs), which support abundant parallelism in hardware.
In this paper, we describe the challenges in mapping StreamIt to GPUs and propose an efficient technique to software pipeline the execution of stream programs on GPUs. We formulate this problem—both scheduling and assignment of filters to processors – as an efficient Integer Linear Program (ILP), which is then solved using ILP solvers. We also describe a novel buffer layout technique for GPUs which facilitates exploiting the high memory bandwidth available in GPUs. The proposed scheduling exploits both the scalar units in GPU, to exploit data parallelism, and multiprocessors, to exploit task and pipeline parallelism. Further it takes into consideration the synchronization and bandwidth limitations of GPUs, yielding speedups between 1.87x and 36.83x over a single threaded CPU.
(Abhishek Udupa, R. Govindarajan, Matthew J. Thazhuthaveetil: Software Pipelined Execution of Stream Programs on GPUs, International Symposium on Code Generation and Optimization 2009 (CGO 2009), pages 200–209. DOI 10.1109/CGO.2009.20, direct link to PDF)
R is a popular open source environment for statistical computing, widely used in many application domains. The ongoing R+GPU project is devoted to moving frequently used R functions, mostly functions used in biomedical research, to the GPU using CUDA. If a CUDA-compatible GPU and driver are present on a user’s machine, the user may only need to prefix “gpu” to the original function name to take advantage of the GPU implementation of the corresponding R function.
Speedup measurements of the current implementation range as high as 80x, and contributions to the code base are cordially invited. R+GPU is developed at the University of Michigan’s Molecular and Behavioral Neuroscience Institute
F2C-ACC is a language translator to convert codes from Fortran into C and C for CUDA. The goal of this project is to reduce the time to convert and adapt existing large-scale Fortran applications to run on CUDA-accelerated clusters, and to reduce the effort to maintain both Fortran and CUDA implementations. Both translations are useful: C can be used for testing and as a base code for running on the IBM Cell processor, and the generated C for CUDA code serves as a basis for running on the GPU. The current implementation does not support all language constructs yet, but the generated human-readable code can be used as a starting point for further manual adaptations and optimizations.
F2C-ACC is developed by Mark Govett et al. at the NOAA Earth System Research Laboratory, and has been presented at the Path to Petascale NCSA/UIUC workshop on applications for accelerators and accelerator clusters.
This paper explores the challenges in implementing a message passing interface usable on systems with data-parallel processors. As a case study, we design and implement the “DCGN” API on NVIDIA GPUs that is similar to MPI and allows full access to the underlying architecture. We introduce the notion of data-parallel thread-groups as a way to map resources to MPI ranks. We use a method that also allows the data-parallel processors to run autonomously from user-written CPU code. In order to facilitate communication, we use a sleep-based polling system to store and retrieve messages. Unlike previous systems, our method provides both performance and flexibility. By running a test suite of applications with different communication requirements, we find that a tolerable amount of overhead is incurred, somewhere between one and five percent depending on the application, and indicate the locations where this overhead accumulates. We conclude that with innovations in chipsets and drivers, this overhead will be mitigated and provide similar performance to typical CPU based MPI implementations while providing fully-dynamic communication.
(Jeff A. Stuart and John D. Owens, Message Passing on Data-Parallel Architectures, Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium)
This article by Jeff Layton at ClusterMonkey summarizes the history of GPU Computing in terms of high-level programming languages and abstractions, from the early days of GPGPU programming using graphics APIs, to Stream, CUDA and OpenCL. The second half of the article provides an introduction to the PGI 8.0 Technology Preview, which allows the use of pragmas to automatically parallelize and run compute-intensive kernels in standard C and Fortran code on accelerators like GPUs. (GPU Programming For the Rest Of Us, Jeff Layton, ClusterMonkey.net)
DECEMBER 19, 2008- NVIDIA has announced the availability of version 2.1 beta of its CUDA toolkit and SDK. This is the latest version of the C-compiler and software development tools for accessing the massively parallel CUDA compute architecture of NVIDIA GPUs. In response to overwhelming demand from the developer community, this latest version of the CUDA software suite includes support for NVIDIA®® Tesla™ GPUs on Windows Vista and 32-bit debugger support for CUDA on RedHat Enterprise Linux 5.x (separate download).
The CUDA Toolkit and SDK 2.1 beta includes support for VisualStudio 2008 support on Windows XP and Vista and Just-In-Time (JIT) compilation for applications that dynamically generate CUDA kernels. Several new interoperability APIs have been added for Direct3D 9 and Direct3D 10 that accelerate communication to DirectX applications as well as a series of improvements to OpenGL interoperability.
CUDA Toolkit and SDK 2.1 beta also features support for using a GPU that is not driving a display on Vista, a beta of Linux Profiler 1.1 (separate download) as well as support for recent releases of Linux including Fedora9, OpenSUSE 11 and Ubuntu 8.04.
CUDA Toolkit and SDK 2.1 beta is available today for free download from www.nvidia.com/object/cuda_get.
The Khronos™ Group today announced the ratification and public release of the OpenCL™ 1.0 specification, the first open, royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. OpenCL (Open Computing Language) greatly improves speed and responsiveness for a wide spectrum of applications in numerous market categories from gaming and entertainment to scientific and medical software. Proposed six months ago as a draft specification by Apple, OpenCL has been developed and ratified by industry-leading companies including 3DLABS, Activision Blizzard, AMD, Apple, ARM, Barco, Broadcom, Codeplay, Electronic Arts, Ericsson, Freescale, HI, IBM, Intel Corporation, Imagination Technologies, Kestrel Institute, Motorola, Movidia, Nokia, NVIDIA, QNX, RapidMind, Samsung, Seaweed, TAKUMI, Texas Instruments and Umeå University. The OpenCL 1.0 specification and more details are available at http://www.khronos.org/opencl/
At Khronos “Developer University” today at SIGGRAPH Asia in Singapore, Khronos members publicly launched OpenCL 1.0 with a presentation of the specification and source code examples.
This Ph.D. thesis by Jansen describes a GPGPU development system that is embedded in the C++ programming language using ad-hoc polymorphism (i.e. operator overloading). While this technique is already known from the Sh library and the RapidMind Development Platform, GPU++ uses a more generic class interface and requires no knowledge of GPU programming at all. Furthermore, there is no separation between the different computation units of the CPU and GPU – the appropriate computation frequency is automatically chosen by the GPU++ system using several optimization algorithms. (“GPU++: An Embedded GPU Development System for General-Purpose Computations“. Thomas Jansen. Ph.D. Thesis, University of Munich, Germany.
From the introduction: “Processors architecture is evolving towards more software-exposed parallelism through two features: more cores and wider SIMD ISA. At the same time, graphics processors (GPUs) are gradually adding more general purpose programming features. Several software development challenges arise from these trends. First, how do we mitigate the increased software development complexity that comes with exposing parallelism to the developer? Second, how do we provide portability across (increasing) core counts and SIMD ISA? Ct is a deterministic parallel programming model intended to leverage the best features of emerging general-purpose GPU (GPGPU) programming models while fully exploiting CPU flexibility. A key distinction of Ct is that it comprises a top-down design of a complete data parallel programming model, rather than being driven bottomup by architectural limitations, a flaw in many GPGPU programming models.” (Flexible Parallel Programming for Terascale Architectures with Ct)