Latest PGI Compilers support OpenACC and CUDA for x86

March 6th, 2012

HPCWire reports:

PORTLAND, Ore., March 5 — The Portland Group, a wholly-owned subsidiary of STMicroelectronics, today announced availability of the 2012 release of the PGI line of high-performance parallelizing compilers and development tools for Linux, OS X and Windows. PGI 2012 is the first general release to include support for the OpenACC directive-based programming model for NVIDIA CUDA-enabled Graphics Processing Units (GPUs). This release is also the first to include the fully feature-enabled PGI CUDA C/C++ compiler for multi-core x64 CPUs from Intel and AMD. In addition, PGI 2012 includes a number of performance and feature enhancements for multi-core x64 processor-based HPC systems.

 

Chai, a new managed platform for GPGPU

February 13th, 2012

Chai is a new managed platform for GPGPU. It is a free and open source clean room workalike of the PeakStream platform. While not production-ready, the just-released alpha version is able to compile and run non-trivial PeakStream demo code on AMD and NVIDIA GPUs (e.g. conjugate gradient).

Chai combines an application virtual machine, garbage collection, auto-tuning JIT compiler, and high level array programming language implemented as an embedded domain-specific language in C++. The JIT back-end uses expectation-maximization to auto-tune and generate vectorized OpenCL. The JIT includes auto-tuned model families for GEMM and GEMV. Although originally developed for AMD GPUs, these parameterized kernel families also generalize to NVIDIA GPUs.

CUDA 4.1 Released

January 26th, 2012

Today NVIDIA released CUDA 4.1, including a new CUDA Toolkit, SDK, Visual Profiler, Parallel Nsight IDE and NVIDIA device driver.

CUDA 4.1 makes it easier to accelerate scientific research with GPUs with key features including

  • a redesigned Visual Profiler with automated performance analysis and expert guidance;
  • a new LLVM-based compiler that generates up to 10% faster code; and
  • 1000+ new imaging and signal processing functions in the NPP library.

The CuSparse library included with CUDA 4.1 has a new tridiagonal solver and 2x faster sparse matrix-vector multiplication using the ELL hybrid format, and the CuRand library included with CUDA 4.1 has two new random number generators. Read the rest of this entry »

CLCC v0.3.0 now available

January 16th, 2012

CLCC, the light-weight and flexible utility for integrating OpenCL source builds into your project has just been updated to version 0.3.0. This version allows developers to save compiled binaries as object files for distribution with their programs and adds a series of options to select specific target platform/device combinations. Documentation and further information is available at http://clcc.sourceforge.net.

Intel SPMD Compiler Version 1.1 Released

December 7th, 2011

A major new release of the Intel SPMD Program Compiler (ispc) was posted on December 5, 2011. ispc is an extended version of the C programming language with support for “single program, multiple data” (SPMD) programming on the CPU; the SPMD model makes it easy to harness the full power of both the SIMD vector units and multiple cores on modern CPUs. The major features added in the 1.1 release include:

  • Full support for pointers, including pointer arithmetic, function pointers, and all other features of pointers in C.
  • A new parallel “foreach” statement, for more easily mapping computation to data.
  • Substantially revised documentation, including a new Performance Guide.
  • Many other small bug fixes and improvements.

ispc is open-source and is licensed under the BSD license. Source and binaries are available from http://ispc.github.com.

Microsoft Announces C++ AMP

June 26th, 2011

Microsoft has announced that the next version of Visual Studio will contain technology labeled C++ Accelerated Massive Parallelism (C++ AMP) to enable C++ developers to take advantage of the GPU for computation purposes. More information is available in the MSDN blog posts here and here.

Intel announces a high-performance SPMD compiler for the CPU

June 26th, 2011

Intel has announced ispc, The Intel SPMD Program Compiler, now available in source and binary form from http://ispc.github.com.

ispc is a new compiler for “single program, multiple data” (SPMD) programs; the same model that is used for (GP)GPU programming, but here targeted to CPUs. ispc compiles a C-based SPMD programming language to run on the SIMD units of CPUs; it frequently provides a a 3x or more speedup on CPUs with 4-wide SSE units, without any of the difficulty of writing intrinsics code. There were a few principles and goals behind the design of ispc:

  • To build a small C-like language that would deliver excellent performance to performance-oriented programmers who want to run SPMD programs on the CPU.
  • To provide a thin abstraction layer between the programmer and the hardware—in particular, to have an execution and data model where the programmer can cleanly reason about the mapping of their source program to compiled assembly language and the underlying hardware.
  • To make it possible to harness the computational power of the SIMD vector units without the extremely low-programmer-productivity activity of directly writing intrinsics.
  • To explore opportunities from close coupling between C/C++ application code and SPMD ispc code running on the same processor—to have lightweight function calls between the two languages, to share data directly via pointers without copying or reformatting, and so forth.

ispc is an open source compiler with a BSD license. It uses the LLVM Compiler Infrastructure for back-end code generation and optimization and is hosted on github. It supports Windows, Mac, and Linux, with both x86 and x86-64 targets. It currently supports the SSE2 and SSE4 instruction sets, though support for AVX should be available soon.

Ocelot: A Dynamic Optimization Framework for Bulk-Synchronous Applications in Heterogeneous Systems

July 29th, 2010

Abstract:

Ocelot is a dynamic compilation framework designed to map the explicitly data parallel execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms. Ocelot includes a dynamic binary translator from Parallel Thread eXecution ISA (PTX) to many-core processors that leverages the Low Level Virtual Machine (LLVM) code generator to target x86 and other ISAs. The dynamic compiler is able to execute existing CUDA binaries without recompilation from source and supports switching between execution on an NVIDIA GPU and a many-core CPU at runtime. It has been validated against over 130 applications taken from the CUDA SDK, the UIUC Parboil benchmark, the Virginia Rodinia benchmarks, the GPU-VSIPL signal and image processing library, the Thrust library, and several domain specific applications.

This paper presents a high level overview of the implementation of the Ocelot dynamic compiler highlighting design decisions and trade-offs, and showcasing their effect on application performance. Several novel code transformations are explored that are applicable only when compiling explicitly parallel applications and traditional dynamic compiler optimizations are revisited for this new class of applications. This study is expected to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures.

This paper identifies several key areas of research and open problems for optimizing the performance of data parallel programs (such as CUDA and OpenCL) that were encountered when designing a binary translator from PTX to LLVM/x86. The complete implementation of Ocelot is available open-source under the new BSD license at http://code.google.com/p/gpuocelot. Ongoing work involves translating PTX to AMD’s IL allowing CUDA programs to be executed on AMD GPUs, developing parallel-aware PTX to PTX optimizations, and exploring new programming and execution models that are layered on PTX.

(Gregory Diamos, Andrew Kerr, Sudhakar Yalamanchili and Nathan Clark: “Ocelot: A dynamic compiler for bulk-synchroneous applications in heterogeneous systems”. 19 International Conference on Parallel Architectures and Compilation Techniques (PACT2010), September 2010).

New OpenCL back-end in CAPS HMPP 2.3 hybrid compiler

June 6th, 2010

CAPS has recently added an OpenCL code generator to the just released 2.3 version of its HMPP directive-based hybrid compiler. Also, the CUDA back-end generator has been enhanced with Fermi capabilities and this new release brings support for more native compilers with Intel ifort/icc, GNU gcc/gfortran and PGI pgcc/pgfort compilers, enabling developers to freely use their favorite compiler with HMPP 2.3.

Based on GPU programming and tuning directives, HMPP offers an incremental programming model that allows developers with different levels of expertise to fully exploit GPU hardware accelerators in their legacy code. Read the rest of this entry »

Compiling Python to a hybrid execution environment

April 12th, 2010

Abstract:

A new compilation framework enables the execution of numerical-intensive applications, written in Python, on a hybrid execution environment formed by a CPU and a GPU. This compiler automatically computes the set of memory locations that need to be transferred to the GPU, and produces the correct mapping between the CPU and the GPU address spaces. Thus, the programming model implements a virtual shared address space. This framework is implemented as a combination of unPython, an ahead-of-time compiler from Python/NumPy to the C++ programming language, and jit4GPU, a just-in-time compiler to the AMD CAL interface using CAL pixel shaders. Jit4GPU includes an optimizer that performs several loop transformations and reduces the number of texture instructions. Experimental evaluation was done on a Radeon 4850 and demonstrates that for some benchmarks the generated GPU code is 50 times faster than generated OpenMP code. The GPU performance also compares favorably with optimized CPU BLAS code for single-precision computations in most cases. Code transformations performed by Jit4GPU on GPU code were also shown to produce considerable speedup compared to unoptimized GPU code.

(Rahul Garg and José Nelson Amaral: “Compiling Python to a Hybrid Execution Environment”. Third Workshop on General-Purpose Computation on Graphics Processing Units, held in conjunction with ASPLOS XV, Pittsburgh, PA, March, 2010. [DOI])

Page 1 of 212