Microsoft has announced that the next version of Visual Studio will contain technology labeled C++ Accelerated Massive Parallelism (C++ AMP) to enable C++ developers to take advantage of the GPU for computation purposes. More information is available in the MSDN blog posts here and here.
Microsoft Announces C++ AMP
June 26th, 2011Intel announces a high-performance SPMD compiler for the CPU
June 26th, 2011Intel has announced ispc, The Intel SPMD Program Compiler, now available in source and binary form from http://ispc.github.com.
ispc is a new compiler for “single program, multiple data” (SPMD) programs; the same model that is used for (GP)GPU programming, but here targeted to CPUs. ispc compiles a C-based SPMD programming language to run on the SIMD units of CPUs; it frequently provides a a 3x or more speedup on CPUs with 4-wide SSE units, without any of the difficulty of writing intrinsics code. There were a few principles and goals behind the design of ispc:
- To build a small C-like language that would deliver excellent performance to performance-oriented programmers who want to run SPMD programs on the CPU.
- To provide a thin abstraction layer between the programmer and the hardware—in particular, to have an execution and data model where the programmer can cleanly reason about the mapping of their source program to compiled assembly language and the underlying hardware.
- To make it possible to harness the computational power of the SIMD vector units without the extremely low-programmer-productivity activity of directly writing intrinsics.
- To explore opportunities from close coupling between C/C++ application code and SPMD ispc code running on the same processor—to have lightweight function calls between the two languages, to share data directly via pointers without copying or reformatting, and so forth.
ispc is an open source compiler with a BSD license. It uses the LLVM Compiler Infrastructure for back-end code generation and optimization and is hosted on github. It supports Windows, Mac, and Linux, with both x86 and x86-64 targets. It currently supports the SSE2 and SSE4 instruction sets, though support for AVX should be available soon.
CUDA 4.0 Library Performance Overview
June 26th, 2011The performance of many math functions has improved with the release of the CUDA 4.0 Toolkit. This presentation includes the performance results of many of the key functions. Results include performance measurements for:
- cuFFT – Fast Fourier Transforms Library
- cuBLAS – Complete BLAS Library
- cuSPARSE – Sparse Matrix Library
- cuRAND – Random Number Generation (RNG) Library
- NPP – Performance Primitives for Image & Video Processing
- Thrust – Templated Parallel Algorithms & Data Structures
- math.h – C99 floating-point Library
Parallel Solution of Sparse Triangular Linear Systems
June 26th, 2011Abstract:
A novel algorithm for solving in parallel a sparse triangular linear system on a graphical processing unit is proposed. It implements the solution of the triangular system in two phases. First, the analysis phase builds a dependency graph based on the matrix sparsity pattern and groups the independent rows into levels. Second, the solve phase obtains the full solution by iterating sequentially across the constructed levels. The solution elements corresponding to each single level are obtained at once in parallel. The numerical experiments are also presented and it is shown that the incomplete-LU and Cholesky preconditioned iterative methods, using the parallel sparse triangular solve algorithm, can achieve on average more than 2x speedup on graphical processing units (GPUs) over their CPU implementation.
(Maxim Naumov: “Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU”, NVIDIA Technical Report, June 2011. [WWW])
GPU Computing and C++: An Evening with Microsoft and NVIDIA
June 26th, 2011In Silicon Valley? Interested in C++? Join in an evening with Microsoft & NVIDIA to discuss new C++ technology for parallel computing. Register here: http://vnextmsvc.eventbrite.com/
- 5:45 PM Welcome & Registration
- 6:00 PM Heterogeneous Parallelism in General, C++ in AMP in Particular, presented by Herb Sutter, Principal Architect for Windows C++, Microsoft
- 7:15 PM ALM tools for C++ in Visual Studio V.NEXT, presented by Rong Lu, Program Manager C++, Microsoft
- 8:00 PM The Power of Parallel, presented by the NVIDIA Team;
- Parallel Nsight: Programming GPUs in Visual Studio, Stephen Jones, NVIDIA;
- CUDA 4.0: Parallel Programming Made Easy, Justin Luitjens, NVIDIA;
- Thrust: C++ Template Library for GPGPUs, Jared Hoberock, NVIDIA
Refreshments provided.
Scalable instruction set simulator for thousand-core architectures running on GPGPUs.
June 26th, 2011Abstract:
Simulators are still the primary tools for development and performance evaluation of applications running on massively parallel architectures. However, current virtual platforms are not able to tackle the complexity issues introduced by 1000-core future scenarios. We present a fast and accurate simulation framework targeting extremely large parallel systems by specifically taking advantage of the inherent potential processing parallelism available in modern GPGPUs.
(S. Raghav, M. Ruggiero, D. Atienza, C. Pinto, A. Marongiu and L. Benini: “Scalable instruction set simulator for thousand-core architectures running on GPGPUs”, Proceedings of High Performance Computing and Simulation (HPCS), pp.459-466, June/July 2010. [DOI] [WWW])
CUDA and OpenCL supported in Indigo Renderer 3.0 and Indigo RT
June 26th, 2011From a recent announcement:
Glare Technologies is proud to announce the release of Indigo Renderer 3.0 and Indigo RT. We use a hybrid GPU acceleration approach, which typically results in a 2-3x speedup when paired with a sufficiently powerful CPU. Realtime scene changes are possible, also in conjunction with network rendering to further accelerate rendering performance. A page outlining the other features and improvements of Indigo 3.0 and Indigo RT can be found at http://www.indigorenderer.com/indigo3 and http://www.indigorenderer.com/indigo_rt.
GPIUTMD 0.9.6 released
June 26th, 2011GPIUTMD stands for Graphic Processors at Isfahan University of Technology for Many-particle Dynamics. It performs general-purpose many-particle dynamic simulations on a single workstation, taking advantage of NVIDIA GPUs to attain a level of performance equivalent to thousands of cores on a fast cluster. Flexible and configurable, GPIUTMD is currently being used for all atom and coarse-grained molecular dynamics simulations of nano-materials, glasses, and surfactants; dissipative particle dynamics simulations (DPD) of polymers; and crystallization of metals using EAM potentials.
GPIUTMD 0.9.6 adds many new features. Highlights include:
- Morse bond potential
- Adding constant acceleration to a group of particles. (useful for modeling gravity effects)
- Computes the full virial stress tensor (useful in mechanical characterization of materials)
- Long-ranged electrostatics via PPPM
- Support for CUDA 3.2
- Theory manual
- Up to twenty percent boost in simulations
- and more
A demo version of GPIUTMD 0.9.6 will be available soon for download under an open source license. Check out the quick start tutorial to get started, or check out the full documentation to see everything it can do.
Anjuta Project Wizards for AMD, NVIDIA and Intel OpenCL SDK
June 14th, 2011Aiming at easing OpenCL development on Linux, Wendell Rodrigues has created wizards to start OpenCL application projects using the SDKs from NVIDiA, AMD or Intel, based on Anjuta DevStudio on Linux. Refer to his blog for details and downloads.
OpenCL Parallel Primitives Library
June 3rd, 2011clpp is an OpenCL library of data-parallel algorithm primitives such as parallel prefix sum (“scan”), parallel sort and parallel reduction. Primitives such as these are important building blocks for a wide variety of data-parallel algorithms, including sorting, stream compaction, and building data structures such as trees and summed-area tables. For more information, visit http://code.google.com/p/clpp.