Acceleware OpenCL, CUDA and AMP training

August 9th, 2012

The fall schedule for Acceleware’s training courses is now available.

  • OpenCL: August 21-24, 2012, Houston, TX
  • CUDA: October 2-5, 2012, San Jose, CA
  • OpenCL: October 16-19, 2012, Calgary, AB
  • CUDA: November 6-9, 2012, Houston, TX
  • CUDA: December 4-7, 2012, New York, NY – Finance Focus
  • AMP: December 11-14, 2012, Chicago, IL

More information: http://www.acceleware.com/training

OpenACC Compilers Now Available from PGI

July 27th, 2012

PGI Release 12.6 is now out. New in this release:

  • PGI Accelerator compilers — first release of the Fortran and C compilers to include comprehensive support for the OpenACC 1.0 specification including the acc cache construct and the entire OpenACC API library. See the PGI Accelerator page for a complete list of supported features.
  • CUDA Toolkit — PGI Accelerator compilers and CUDA Fortran now include support for CUDA Toolkit version 4.2; version 4.1 is now the default.

Download a free trial from the PGI website at http://www.pgroup.com/support/download_pgi2012.php?view=current. Upcoming PGI webinar with Michael Wolfe. 9:00AM PDT, July 31st sponsored by NVIDIA: “Using OpenACC Directives with the PGI Accelerator Compilers”. Register at http://www.pgroup.com/webinar212.htm?clicksource=gpgpu712.

MC# 3.0 with GPU support

July 22nd, 2012

Version 3.0 of the MC# programming system has been released. MC# is an universal parallel programming language aimed to any parallel architecture  -  multicore processors, systems with GPU, or clusters. It is an extension of C# language and supports high-level parallel programming style.

SnuCL – OpenCL heterogeneous cluster computing

June 27th, 2012

SnuCL is an OpenCL framework and freely available, open-source software developed at Seoul National University. It naturally extends the original OpenCL semantics to the heterogeneous cluster environment. The target cluster consists of a single host node and multiple compute nodes. They are connected by an interconnection network, such as Gigabit and InfiniBand switches. The host node contains multiple CPU cores and each compute node consists of multiple CPU cores and multiple GPUs. For such clusters, SnuCL provides an illusion of a single heterogeneous system for the programmer. A GPU or a set of CPU cores becomes an OpenCL compute device. SnuCL allows the application to utilize compute devices in a compute node as if they were in the host node. Thus, with SnuCL, OpenCL applications written for a single heterogeneous system with multiple OpenCL compute devices can run on the cluster without any modifications. SnuCL achieves both high performance and ease of programming in a heterogeneous cluster environment.

SnuCL consists of SnuCL runtime and compiler. The SnuCL compiler is based on the OpenCL C compiler in SNU-SAMSUNG OpenCL framework. Currently, the SnuCL compiler supports x86, ARM, and PowerPC CPUs, AMD GPUs, and NVIDIA GPUs.

New Three-part Webinar Series: OpenCL Programming on Intel® Processors

June 12th, 2012

Register today up now for a webinar series on how to use the Intel® SDK for OpenCL Applications to best utilize the CPU and Intel® HD Graphics of 3rd Gen Intel® Core™ processors for developing OpenCL applications:

 

VexCL: Vector expression template library for OpenCL

May 30th, 2012

VexCL is vector expression template library for OpenCL developed by the Supercomputer Center of Russian academy of sciences. It has been created for ease of C++ based OpenCL development. Multi-device (and multi-platform) computations are supported. The code is publicly available under MIT license.

Main features:

  • Selection and initialization of compute devices according to extensible set of device filters.
  • Transparent allocation of device vectors spanning multiple devices.
  • Convenient notation for vector arithmetic, sparse matrix-vector multiplication, reductions. All computations are performed in parallel on all selected devices.
  • Appropriate kernels for vector expressions are generated automatically first time an expression is used.

Doxygen-generated documentation is available at http://ddemidov.github.com/vexcl/index.html.

C++ AMP Courses

May 26th, 2012

C++ Accelerated Massive Parallelism (C++ AMP) is a new open specification heterogeneous programming model, which builds on the established C++ language. Developed for heterogeneous platforms C++ AMP is designed to accelerate the execution of C++ code by taking advantage of the data-parallel hardware that is commonly present as a GPU. These courses are aimed at programmers who are looking to develop comprehensive skills in writing and optimizing applications using C++ AMP. Read the rest of this entry »

Accelerate OpenFOAM with SpeedIT 2.1

May 24th, 2012

SpeedIT provides a set of accelerated solvers for sparse linear systems of equations. The library supports C/C++ and Fortran, and it can be used with OpenFOAM to accelerate CFD simulations. SpeedIT 2.1 contains two new preconditioners:

• Algebraic Multigrid with Smoothed Aggregation (AMG)
• Approximate Inverse (AINV)

OpenFOAM simulations on the GPU can be up to 3.5x faster compared to CG and DIC/DILU preconditioners on the CPU and up to 1.6x faster if you run GAMG.

See the SpeedIT website and blog for more details.

Panoptes: A Binary Translation Framework for CUDA

May 22nd, 2012

Traditional CPU-based computing environments offer a variety of binary instrumentation frameworks. Instrumentation and analysis tools for GPU environments to date have been more limited. Panoptes is a binary instrumentation framework for CUDA that targets the GPU. By exploiting the GPU to run modified kernels, computationally-intensive programs can be run at the native parallelism of the device during analysis. To demonstrate its instrumentation capabilities, we currently implement a memory addressability and validity checker that targets CUDA programs.

Panoptes traces targeted programs by library interposition at runtime. Read the rest of this entry »

NVIDIA Kepler GK110 Architecture White Paper

May 20th, 2012

NVIDIA Kepler GK110 Die Shot

This white paper describes the new Kepler  GK110 Architecture from NVIDIA.

Comprising 7.1 billion transistors, Kepler GK110 is not only the fastest, but also the most architecturally complex microprocessor ever built. Adding many new innovative features focused on compute performance, GK110 was designed to be a parallel processing powerhouse for Tesla® and the HPC market.

Kepler GK110 will provide over 1 TFlop of double precision throughput with greater than 80% DGEMM efficiency versus 60‐65% on the prior Fermi architecture.

In addition to greatly improved performance, the Kepler architecture offers a huge leap forward in power efficiency, delivering up to 3x the performance per watt of Fermi.

The paper describes features of the Kepler GK110 architecture, including

  • Dynamic Parallelism;
  • Hyper-Q;
  • Grid Management Unit;
  • NVIDIA GPUDirect™;
  • New SHFL instruction and atomic instruction enhancements;
  • New read-only data cache previously only accessible to texture;
  • Bindless Textures;
  • and much more.
Page 3 of 3612345...102030...Last »