A Fast GEMM Implementation on a Cypress GPU

October 12th, 2010


We present benchmark results of optimized dense matrix multiplication kernels for a Cypress GPU. We write general matrix multiply (GEMM) kernels for single (SP), double (DP) and double-double (DDP) precision. Our SGEMM and DGEMM kernels show 73% and 87% of the theoretical performance of the GPU, respectively. Currently, our SGEMM and DGEMM kernels are fastest with one GPU chip to our knowledge. Furthermore, the performance of our matrix multiply kernel in DDP is 31 Gflop/s. This performance in DDP is more than 200 times faster than the performance in DDP on single core of a recent CPU (with mpack version 0.6.5). We describe our GEMM kernels with main focus on the SGEMM implementation since all GEMM kernels share common programming and optimization techniques. While a conventional wisdom of GPU programming recommends us to heavily use shared memory on GPUs, we show that texture cache is very effective on the Cypress architecture.

(N. Nakasato: “A Fast GEMM Implementation on a Cypress GPU”, 1st International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10) November 2010. A sample program is available at http://github.com/dadeba/dgemm_cypress)

ATI Stream SDK v2.2 w/ OpenCL 1.1 Support released

August 22nd, 2010

Version 2.2 of the ATI Stream SDK has been released. Features include:

  • Support for OpenCL™ 1.1 specification.
  • Support for Ubuntu® 10.04 and Red Hat® Enterprise Linux® 5.5.
  • Support for X86 CPUs with SSE2.x or later (Adds to existing support for X86 CPUs with SSE3.x or later).
  • Support for Microsoft® Visual Studio® 2010 Professional Edition and Minimalist GNU for Windows (MinGW) [GCC 4.4].
  • Support for GNU Compiler Collection (GCC) 4.1 or later on Linux® systems (Adds to existing support for GCC 4.3 or later).
  • Support for single-channel OpenCL™ image format.
  • Support for OpenCL™ / DirectX® 10 interoperability.
  • Support for additional double-precision floating point routines in OpenCL™ C kernels.
  • Support for generating and loading binary OpenCL™ kernels.
  • Support for native OpenCL™ kernels.
  • Preview Feature: Support for accessing additional physical memory on the GPU from OpenCL™ applications.
  • Preview Feature: Support for printf() in OpenCL™ C kernels.
  • Extension: Support for additional event states when registering event callbacks in OpenCL™ 1.1.
  • Additional OpenCL™ samples.
  • Package Update: ATI Stream Profiler 1.4.
  • Various OpenCL™ compiler and runtime fixes and enhancements.
  • Expanded OpenCL™ performance optimization guidelines in the ATI Stream SDK OpenCL™ Programming Guide.

The SDK and all documentation can be downloaded from http://developer.amd.com/stream.

Introductory OpenCL Tutorial

July 8th, 2010

This tutorial by Benedict R. Gaster from AMD provides a detailed introduction to OpenCL. Covered topics include:

  • Using platform and device layers to build robust OpenCL™ applications
  • Program compilation and kernel objects
  • Managing buffers
  • Kernel execution
  • Kernel programming – basics
  • Kernel programming – synchronization
  • Matrix multiply – a case study
  • Kernel programming – built-ins

Introductory Tutorial to OpenCL™ for HPC at SAAHPC’10

May 30th, 2010

AMD is offering an introductory tutorial to OpenCL™ that will be held alongside the 2010 Symposium on Application Accelerators in High Performance Computing (SAAHPC’10). The tutorial is a “programmer’s introduction” which covers the ideas behind OpenCL™ and their translation to source code. Read the rest of this entry »

Compiling Python to a hybrid execution environment

April 12th, 2010


A new compilation framework enables the execution of numerical-intensive applications, written in Python, on a hybrid execution environment formed by a CPU and a GPU. This compiler automatically computes the set of memory locations that need to be transferred to the GPU, and produces the correct mapping between the CPU and the GPU address spaces. Thus, the programming model implements a virtual shared address space. This framework is implemented as a combination of unPython, an ahead-of-time compiler from Python/NumPy to the C++ programming language, and jit4GPU, a just-in-time compiler to the AMD CAL interface using CAL pixel shaders. Jit4GPU includes an optimizer that performs several loop transformations and reduces the number of texture instructions. Experimental evaluation was done on a Radeon 4850 and demonstrates that for some benchmarks the generated GPU code is 50 times faster than generated OpenMP code. The GPU performance also compares favorably with optimized CPU BLAS code for single-precision computations in most cases. Code transformations performed by Jit4GPU on GPU code were also shown to produce considerable speedup compared to unoptimized GPU code.

(Rahul Garg and José Nelson Amaral: “Compiling Python to a Hybrid Execution Environment”. Third Workshop on General-Purpose Computation on Graphics Processing Units, held in conjunction with ASPLOS XV, Pittsburgh, PA, March, 2010. [DOI])

Accelerating MATLAB Image Processing Toolbox Functions on GPUs

March 23rd, 2010


We present our effort in developing an open-source GPU (graphics processing units) code library for the MATLAB Image Processing Toolbox (IPT). We ported a dozen of representative functions from IPT and based on their inherent characteristics, we grouped these functions into four categories: data independent, data sharing, algorithm dependent and data dependent. For each category, we present a detailed case study, which reveals interesting insights on how to efficiently optimize the code for GPUs and highlight performance-critical hardware features, some of which have not been well explored in existing literature. Our results show drastic speedups for the functions in the data-independent or data-sharing category by leveraging hardware support judiciously; and moderate speedups for those in the algorithm-dependent category by careful algorithm selection and parallelization. For the functions in the last category, fine-grain synchronization and data-dependency requirements are the main obstacles to an efficient implementation on GPUs.

(J. Kong, et. al., “Accelerating MATLAB Image Processing Toolbox Functions on GPUs”, Proceedings of the Third Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU-3), Pittsburgh, PA. Apr. 2010. Source code is available here.)

Pseudo-random number generators for Monte Carlo simulations on Graphics Processing Units

March 14th, 2010


Basic uniform pseudo-random number generators are implemented on ATI Graphics Processing Units (GPU). The performance results of the realized generators (multiplicative linear congruential (GGL), XOR-shift (XOR128), RANECU, RANMAR, RANLUX and Mersenne Twister (MT19937)) on CPU and GPU are discussed. The obtained speed-up factor is hundreds of times in comparison with CPU. RANLUX generator is found to be the most appropriate for using on GPU in Monte Carlo simulations. The brief review of the pseudo-random number generators used in modern software packages for Monte Carlo simulations in high-energy physics is present.

(Vadim Demchik, “Pseudo-random number generators for Monte Carlo simulations on Graphics Processing Units”, Mar. 2010, arXiv:1003.1898 [hep-lat])

ATI Stream SDK 2.0 Production Release

January 26th, 2010

From the release notes:

ATI Stream SDK 2.0 is the first production SDK for both AMD GPUs and x86 CPUs. This release supports a wide range of ATI graphics processors, including the new ATI Radeon HD 5970, and provides support for OpenCL ICD (Installable Client Driver), atomic functions for 32-bit integers, a Microsoft Visual Studio 2008-integrated ATI Stream Profiler performance analysis tool, and other robust features. Preview support for upcoming features include OpenCL and Microsoft DirectX 10 interoperability, and double-precision floating point basic arithmetic in OpenCL C kernels.

AMD STREAM SDK v2.0 beta Supports OpenCL on CPUs and GPUs

October 19th, 2009

AMD’s STREAM SDK v2.0 beta4 is the first release of the STREAM SDK with OpenCL support on CPUs and GPUs. The OpenCL implementation is certified OpenCL 1.0 conformant by the Khronos group. Supported platforms are Windows XP, Vista and Windows 7, and a number of Linux distributions, all in 32 and 64-bit. The implementation supports AMD and Intel multicore CPUs, as well as the two latest GPU generations from AMD.

The STREAM SDK as well as documentation and further information is available on AMD’s developer website.

ATI Radeon™ HD 5800 Series Announced By AMD

October 1st, 2009

AMD announced its latest ATI Radeon™ series of graphics cards on September 23rd.  The new GPUs boast up to 2.72 GFLOP/s of single-precision floating point throughput, along with DirectX® 11 graphics (including DirectCompute) and OpenCL 1.0 support.

From the press release:

AMD (NYSE: AMD) today launched the most powerful processor ever created1, found in its next-generation graphics cards, the ATI Radeon™ HD 5800 series graphics cards, and the world’s first and only to fully support Microsoft DirectX® 112, the new gaming and compute standard shipping shortly with Microsoft Windows® 7operating system. Boasting up to 2.72 TeraFLOPS of compute power, the ATI Radeon™ HD 5800 series effectively doubles the value consumers can expect of their graphics purchases, delivering twice the performance-per-dollar of previous generations of graphics products.3 AMD will initially release two cards: the ATI Radeon HD 5870 and the ATI Radeon HD 5850, each with 1GB GDDR5 memory. With the ATI Radeon™ HD 5800 series of graphics cards, PC users can expand their computing experience with ATI Eyefinity multi-display technology, accelerate their computing experience with ATI Stream technology, and dominate the competition with superior gaming performance and full support of Microsoft DirectX® 11, making it a “must-have” consumer purchase just in time for Microsoft Windows® 7 operating system.

Read the rest of this entry »