Lab4241 GP-GPU profiler

February 21st, 2013

A free, pre-alpha release of Lab4241′s GPGPU profiler is now available at www.lab4241.com. It provides source-code-line performance profiling for C or C++ code and CUDA kernels in a non-intrusive way. The profiler enables the developer to a seamless evaluation of used GPU resources (execution counts, memory access, branch diversions, etc.) per source-line, along with result evaluation in a simple, intuitive GUI, similar as with known CPU profilers like Quantify or valgrind.

rCUDA 4.0 released

December 18th, 2012

rCUDA (remote CUDA) v4.0 has just been released. It provides full binary compatibility with CUDA applications (no need to modify the application source code or recompile your program), native InfiniBand support, enhanced data transfers, and CUDA 5.0 API support (excluding graphics interoperability). This new release of rCUDA allows to execute existing GPU-accelerated applications by leveraging remote GPUs within a cluster (both via sharing and/or aggregating GPUs) with a negligible overhead. The new version is available free of charge ar www.rCUDA.net, along with examples, manuals and additional information.

CUDA 5 Production Release Now Available

October 15th, 2012

The CUDA 5 Production Release is now available as a free download at www.nvidia.com/getcuda.
This powerful new version of the pervasive CUDA parallel computing platform and programming model can be used to accelerate more of applications using the following four (and many more) new features.

• CUDA Dynamic Parallelism brings GPU acceleration to new algorithms by enabling GPU threads to directly launch CUDA kernels and call GPU libraries.
• A new device code linker enables developers to link external GPU code and build libraries of GPU functions.
• NVIDIA Nsight Eclipse Edition enables you to develop, debug and optimize CUDA code all in one IDE for Linux and Mac OS.
• GPUDirect Support for RDMA provides direct communication between GPUs in different cluster nodes

As a demonstration of the power of Dynamic Parallelism and device code linking, CUDA 5 includes a device-callable version of the CUBLAS linear algebra library, so threads already running on the GPU can invoke CUBLAS functions on the GPU. Read the rest of this entry »

AMD CodeXL: comprehensive developer tool suite for heterogeneous compute

October 9th, 2012

AMD CodeXL is a new unified developer tool suite that enables developers to harness the benefits of CPUs, GPUs and APUs. It includes powerful GPU debugging, comprehensive GPU and CPU profiling, and static OpenCL™ kernel analysis capabilities, enhancing accessibility for software developers to enter the era of heterogeneous computing. AMD CodeXL is available for free, both as a Visual Studio® extension and a standalone user interface application for Windows® and Linux®.

AMD CodeXL increases developer productivity by helping them identify programming errors and performance issues in their application quickly and easily. Now developers can debug, profile and analyze their applications with a full system-wide view on AMD APU, GPU and CPUs.

AMD CodeXL user group (requires registration) allows users to interact with the CodeXL team, provide feedback, get support and participate in the beta surveys.

SnuCL – OpenCL heterogeneous cluster computing

June 27th, 2012

SnuCL is an OpenCL framework and freely available, open-source software developed at Seoul National University. It naturally extends the original OpenCL semantics to the heterogeneous cluster environment. The target cluster consists of a single host node and multiple compute nodes. They are connected by an interconnection network, such as Gigabit and InfiniBand switches. The host node contains multiple CPU cores and each compute node consists of multiple CPU cores and multiple GPUs. For such clusters, SnuCL provides an illusion of a single heterogeneous system for the programmer. A GPU or a set of CPU cores becomes an OpenCL compute device. SnuCL allows the application to utilize compute devices in a compute node as if they were in the host node. Thus, with SnuCL, OpenCL applications written for a single heterogeneous system with multiple OpenCL compute devices can run on the cluster without any modifications. SnuCL achieves both high performance and ease of programming in a heterogeneous cluster environment.

SnuCL consists of SnuCL runtime and compiler. The SnuCL compiler is based on the OpenCL C compiler in SNU-SAMSUNG OpenCL framework. Currently, the SnuCL compiler supports x86, ARM, and PowerPC CPUs, AMD GPUs, and NVIDIA GPUs.

VexCL: Vector expression template library for OpenCL

May 30th, 2012

VexCL is vector expression template library for OpenCL developed by the Supercomputer Center of Russian academy of sciences. It has been created for ease of C++ based OpenCL development. Multi-device (and multi-platform) computations are supported. The code is publicly available under MIT license.

Main features:

  • Selection and initialization of compute devices according to extensible set of device filters.
  • Transparent allocation of device vectors spanning multiple devices.
  • Convenient notation for vector arithmetic, sparse matrix-vector multiplication, reductions. All computations are performed in parallel on all selected devices.
  • Appropriate kernels for vector expressions are generated automatically first time an expression is used.

Doxygen-generated documentation is available at http://ddemidov.github.com/vexcl/index.html.

Panoptes: A Binary Translation Framework for CUDA

May 22nd, 2012

Traditional CPU-based computing environments offer a variety of binary instrumentation frameworks. Instrumentation and analysis tools for GPU environments to date have been more limited. Panoptes is a binary instrumentation framework for CUDA that targets the GPU. By exploiting the GPU to run modified kernels, computationally-intensive programs can be run at the native parallelism of the device during analysis. To demonstrate its instrumentation capabilities, we currently implement a memory addressability and validity checker that targets CUDA programs.

Panoptes traces targeted programs by library interposition at runtime. Read the rest of this entry »

New rCUDA version beta testing

April 18th, 2012

The rCUDA Team is proud to announce a new version of the rCUDA framework which will include many new functionalities as well as boosted performance. This new version, cooked for over a year, will incorporate pipelined transfers, full multi-thread and multi-node capabilities, CUDA 4.1 support, global scheduler integration, support for CUDA C extensions, and native InfiniBand support. A closed beta teting program has been started. See the complete text at http://www.rcuda.net/index.php/news/19-new-revolutionary-version-of-rcuda-to-be-launched.html.

Latest PGI Compilers support OpenACC and CUDA for x86

March 6th, 2012

HPCWire reports:

PORTLAND, Ore., March 5 — The Portland Group, a wholly-owned subsidiary of STMicroelectronics, today announced availability of the 2012 release of the PGI line of high-performance parallelizing compilers and development tools for Linux, OS X and Windows. PGI 2012 is the first general release to include support for the OpenACC directive-based programming model for NVIDIA CUDA-enabled Graphics Processing Units (GPUs). This release is also the first to include the fully feature-enabled PGI CUDA C/C++ compiler for multi-core x64 CPUs from Intel and AMD. In addition, PGI 2012 includes a number of performance and feature enhancements for multi-core x64 processor-based HPC systems.

 

Chai, a new managed platform for GPGPU

February 13th, 2012

Chai is a new managed platform for GPGPU. It is a free and open source clean room workalike of the PeakStream platform. While not production-ready, the just-released alpha version is able to compile and run non-trivial PeakStream demo code on AMD and NVIDIA GPUs (e.g. conjugate gradient).

Chai combines an application virtual machine, garbage collection, auto-tuning JIT compiler, and high level array programming language implemented as an embedded domain-specific language in C++. The JIT back-end uses expectation-maximization to auto-tune and generate vectorized OpenCL. The JIT includes auto-tuned model families for GEMM and GEMV. Although originally developed for AMD GPUs, these parameterized kernel families also generalize to NVIDIA GPUs.

Page 1 of 712345...Last »