ATI Stream SDK v2.2 w/ OpenCL 1.1 Support released

August 22nd, 2010

Version 2.2 of the ATI Stream SDK has been released. Features include:

  • Support for OpenCL™ 1.1 specification.
  • Support for Ubuntu® 10.04 and Red Hat® Enterprise Linux® 5.5.
  • Support for X86 CPUs with SSE2.x or later (Adds to existing support for X86 CPUs with SSE3.x or later).
  • Support for Microsoft® Visual Studio® 2010 Professional Edition and Minimalist GNU for Windows (MinGW) [GCC 4.4].
  • Support for GNU Compiler Collection (GCC) 4.1 or later on Linux® systems (Adds to existing support for GCC 4.3 or later).
  • Support for single-channel OpenCL™ image format.
  • Support for OpenCL™ / DirectX® 10 interoperability.
  • Support for additional double-precision floating point routines in OpenCL™ C kernels.
  • Support for generating and loading binary OpenCL™ kernels.
  • Support for native OpenCL™ kernels.
  • Preview Feature: Support for accessing additional physical memory on the GPU from OpenCL™ applications.
  • Preview Feature: Support for printf() in OpenCL™ C kernels.
  • Extension: Support for additional event states when registering event callbacks in OpenCL™ 1.1.
  • Additional OpenCL™ samples.
  • Package Update: ATI Stream Profiler 1.4.
  • Various OpenCL™ compiler and runtime fixes and enhancements.
  • Expanded OpenCL™ performance optimization guidelines in the ATI Stream SDK OpenCL™ Programming Guide.

The SDK and all documentation can be downloaded from

CUVI Lib – CUDA for Vision and Imaging Library Launched

August 1st, 2010

cuvilib logo

TunaCode has announced the release of CUVI Lib v0.3 (Beta version) for Windows 32 and 64 Systems. A copy can be downloaded from

CUVI Lib (CUDA for Vision and Imaging Lib) is an add-on library for NPP (NVIDIA Performance Primitives) and includes several advanced computer vision and image processing functions presently not available in NPP. This version of CUVI Lib supports, among others:

  • Optical Flow (Horn & Shunck)
  • Optical Flow (Lucas & Kanade)
  • Discrete Wavelet Transform (Forward and Inverse)
  • Hough Transform
  • Hough Lines (Lines Detector)
  • Color Conversion (RGB-to-gray and RGBA-to-Gray)

Several more advanced features will be added to CUVI Lib in upcoming releases. A detailed function reference can be downloaded here. Forums to discuss feedback and further ideas are available.

Swarm-NG: integration of an ensemble of N-body systems

July 29th, 2010

The Swarm-NG package helps scientists and engineers harness the power of GPUs. In the early releases, Swarm-NG will focus on the integration of an ensemble of N-body systems evolving under Newtonian gravity. Swarm-NG does not replicate existing libraries that calculate forces for large-N systems on GPUs, but rather focuses on integrating an ensemble of many systems where N is small. This is of particular interest for astronomers who study the chaotic evolution of planetary systems. In the long term, we hope Swarm-NG will allow for the efficient parallel integration of user-defined systems of ordinary differential equations.

Ocelot: A Dynamic Optimization Framework for Bulk-Synchronous Applications in Heterogeneous Systems

July 29th, 2010


Ocelot is a dynamic compilation framework designed to map the explicitly data parallel execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms. Ocelot includes a dynamic binary translator from Parallel Thread eXecution ISA (PTX) to many-core processors that leverages the Low Level Virtual Machine (LLVM) code generator to target x86 and other ISAs. The dynamic compiler is able to execute existing CUDA binaries without recompilation from source and supports switching between execution on an NVIDIA GPU and a many-core CPU at runtime. It has been validated against over 130 applications taken from the CUDA SDK, the UIUC Parboil benchmark, the Virginia Rodinia benchmarks, the GPU-VSIPL signal and image processing library, the Thrust library, and several domain specific applications.

This paper presents a high level overview of the implementation of the Ocelot dynamic compiler highlighting design decisions and trade-offs, and showcasing their effect on application performance. Several novel code transformations are explored that are applicable only when compiling explicitly parallel applications and traditional dynamic compiler optimizations are revisited for this new class of applications. This study is expected to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures.

This paper identifies several key areas of research and open problems for optimizing the performance of data parallel programs (such as CUDA and OpenCL) that were encountered when designing a binary translator from PTX to LLVM/x86. The complete implementation of Ocelot is available open-source under the new BSD license at Ongoing work involves translating PTX to AMD’s IL allowing CUDA programs to be executed on AMD GPUs, developing parallel-aware PTX to PTX optimizations, and exploring new programming and execution models that are layered on PTX.

(Gregory Diamos, Andrew Kerr, Sudhakar Yalamanchili and Nathan Clark: “Ocelot: A dynamic compiler for bulk-synchroneous applications in heterogeneous systems”. 19 International Conference on Parallel Architectures and Compilation Techniques (PACT2010), September 2010).

NVIDIA Parallel Nsight Now Shipping

July 21st, 2010

NVIDIA today announced the release of NVIDIA Parallel Nsight software, the industry’s first development environment for GPU-accelerated applications that work with Microsoft Visual Studio.  “By adding functionality specifically for GPU Computing developers, Parallel Nsight makes the power of the GPU more accessible than ever before,” said Sanford Russell, GM of GPU Computing at NVIDIA. NVIDIA Parallel NSight features a CUDA C/C++ debugger and application performance analyzer, and a graphics debugger and inspector.  NVIDIA Parallel Nsight supports Windows HPC Server 2008, Windows 7 and Windows Vista.  Download Parallel Nsight here.

OpenMM 2.0 Now Available to Accelerate Molecular Dynamics on NVIDIA and ATI GPUs

July 18th, 2010

Simbios, the NIH Center for Biomedical Computation at Stanford University, is excited to announce the release of OPENMM 2.0.

OPENMM was designed to enhance the performance of almost any molecular dynamics simulation package (MD package) by allowing the code to be executed on high performance computer architectures, in particular Graphics Processing Units (GPUs). Most molecular dynamics packages can be modified to call OPENMM, resulting in significant acceleration on such high performance architectures, without changing the way users interact with the MD package. Read the rest of this entry »

CULA 2.0 released

July 11th, 2010

EM Photonics announced today the general availability of CULA 2.0, its GPU-accelerated linear algebra library. The new version provides support for NVIDIA GPUs based on the latest “Fermi” architecture.

CULA contains a LAPACK interface comprised of over 150 mathematical routines from the industry standard for computational linear algebra, LAPACK. EM Photonics’ CULA library includes many popular routines including system solvers, least squares solvers, orthogonal factorizations, eigenvalue routines, and singular value decompositions. CULA offers performance up to a magnitude faster than highly optimized CPU-based linear algebra solvers. There is a variety of different interfaces available to integrate directly into your existing code. Programmers can easily call GPU-accelerated CULA from their C/C++, FORTRAN, MATLAB, or Python codes. This can all be done with no GPU programming experience. CULA is available for every system equipped with GPUs based on the NVIDIA CUDA architecture. This includes 32- and 64-bit versions of Linux, Windows, and OS X.

More information is available at

Introductory OpenCL Tutorial

July 8th, 2010

This tutorial by Benedict R. Gaster from AMD provides a detailed introduction to OpenCL. Covered topics include:

  • Using platform and device layers to build robust OpenCL™ applications
  • Program compilation and kernel objects
  • Managing buffers
  • Kernel execution
  • Kernel programming – basics
  • Kernel programming – synchronization
  • Matrix multiply – a case study
  • Kernel programming – built-ins

gDEBugger V5.6 – Introducing iPhone and iPad on-device debugging and profiling

July 8th, 2010

Graphic Remedy is proud to announce the release of gDEBugger Version 5.6 for Windows, Linux, Mac OS X, iPhone and iPad. This version introduces iPhone and iPad on-device debugging and profiling abilities, letting developers optimize their apps in real-time on actual iPhone and iPad hardware, while viewing invaluable inside information such as the device’s GPU, CPU, graphics driver and operating system performance counters.

gDEBugger is an OpenGL, OpenGL ES and OpenCL debugger and profiler that traces application activity on top of the OpenGL API, and lets programmers see what is happening within the graphics system implementation to find bugs and optimize OpenGL application performance. gDEBugger runs on Windows, Mac OS X, iPhone and Linux operating systems.

Learn CUDA in Sydney or Canberra Next Week

July 7th, 2010

For our Australian readers interested in GPU computing.  Next week there will be two free workshops on GPU Computing with CUDA.  The workshops will both include a tutorial on CUDA C/C++ programming along with additional presentations by local speakers.  Topics will include an overview of NVIDIA Tesla and the latest  Fermi architecture GPUs, CUDA programming, debugging and profiling tools, and optimization strategies.

Follow the links above for full details.  Space is limited, so be sure to RSVP to the addresses provided.