Panoptes: A Binary Translation Framework for CUDA

May 22nd, 2012

Traditional CPU-based computing environments offer a variety of binary instrumentation frameworks. Instrumentation and analysis tools for GPU environments to date have been more limited. Panoptes is a binary instrumentation framework for CUDA that targets the GPU. By exploiting the GPU to run modified kernels, computationally-intensive programs can be run at the native parallelism of the device during analysis. To demonstrate its instrumentation capabilities, we currently implement a memory addressability and validity checker that targets CUDA programs.

Panoptes traces targeted programs by library interposition at runtime. Interactions with the GPU are intercepted, annotated as necessary, and are then sent to the actual CUDA library for execution on the device. This approach gives an analysis tool built on Panoptes a complete view of the state of the GPU without additional developer effort. In contrast, developer-added instrumentation may be incomplete due to errors of omission or cause maintenance difficulties, particularly for large code bases.

By directing annotated instructions to the GPU for execution rather than relying on the host for emulation, Panoptes is able to analyze programs at scale. The rift in parallel execution capabilities between modern GPUs and CPUs carries into testing and debugging as well. For computationally intensive tasks brought to the GPU explicitly for its parallelism, resorting to host-based emulation may necessitate reduced or simplified inputs for analysis. More details: http://github.com/ckennelly/panoptes

NVIDIA Kepler GK110 Architecture White Paper

May 20th, 2012

NVIDIA Kepler GK110 Die Shot

This white paper describes the new Kepler  GK110 Architecture from NVIDIA.

Comprising 7.1 billion transistors, Kepler GK110 is not only the fastest, but also the most architecturally complex microprocessor ever built. Adding many new innovative features focused on compute performance, GK110 was designed to be a parallel processing powerhouse for Tesla® and the HPC market.

Kepler GK110 will provide over 1 TFlop of double precision throughput with greater than 80% DGEMM efficiency versus 60‐65% on the prior Fermi architecture.

In addition to greatly improved performance, the Kepler architecture offers a huge leap forward in power efficiency, delivering up to 3x the performance per watt of Fermi.

The paper describes features of the Kepler GK110 architecture, including

  • Dynamic Parallelism;
  • Hyper-Q;
  • Grid Management Unit;
  • NVIDIA GPUDirect™;
  • New SHFL instruction and atomic instruction enhancements;
  • New read-only data cache previously only accessible to texture;
  • Bindless Textures;
  • and much more.

CUVILib v1.2 released

May 17th, 2012

TunaCode has released CUVILib v1.2, a library to accelerate imaging and computer vision applications. CUVILib adds acceleration to Imaging applications from Medical, Industrial and Defense domains. It delivers very high performance and supports both CUDA and OpenCL. Modules include color operations (demosaic, conversions, correction etc), linear/non-linear filtering, feature extraction & tracking, motion estimation, image transforms and image statistics.

More information, including a free trial version: http://www.cuvilib.com/

OpenCL SDK for new Intel Core Processors

April 27th, 2012

The Intel® SDK for OpenCL Applications now supports the OpenCL 1.1 full-profile on 3rd generation Intel® Core™ processors with Intel® HD Graphics 4000/2500. For the first time, OpenCL developers using Intel® architecture can utilize compute resources across both Intel® Processor and Intel HD Graphics. More information: http://software.intel.com/en-us/articles/vcsource-tools-opencl-sdk

New rCUDA version beta testing

April 18th, 2012

The rCUDA Team is proud to announce a new version of the rCUDA framework which will include many new functionalities as well as boosted performance. This new version, cooked for over a year, will incorporate pipelined transfers, full multi-thread and multi-node capabilities, CUDA 4.1 support, global scheduler integration, support for CUDA C extensions, and native InfiniBand support. A closed beta teting program has been started. See the complete text at http://www.rcuda.net/index.php/news/19-new-revolutionary-version-of-rcuda-to-be-launched.html.

Accelerate Your Science on the Titan Supercomputer

April 1st, 2012

Accelerate your science on the Titan Supercomputer later this year, by harnessing up to 20 petaflops of parallel processing using GPUs. Open to researchers from academia, government labs, and industry, the Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program is the major means by which the scientific community gains access to some of the fastest supercomputers.

First, let INCITE know you are interested in GPU acceleration by completing a two-minute survey. Then determine if you want to submit a formal proposal by June 27, 2012.

Need help drafting your proposal? Attend a “how-to” webinar on Tuesday, April 24 to learn tips and tricks for drafting your proposal. For further questions about the call for proposals, please contact the INCITE manager at INCITE@DOEleadershipcomputing.org.

OpenCL Programming Webinar Series

March 30th, 2012

AMD offers an OpenCL Programming Webinar Series to help software developers become experts in the latest technologies, standards and best practices. The series of three OpenCL webinars will be presented by Rob Farber.

1. April 10th, 10AM PDT: Introducing Portable Parallelism

  • C and C++ APIs
  • OpenCL Memory Spaces
  • The OpenCL Execution Model

2. April 24th, 10AM PDT: Coordinating OpenCL Computations on one more Heterogeneous Devices

  • How to Concisley Utilize Multiple Command Queues and Coordinate Tasks Across Multiple Heterogeneous Devices such as two GPU + CPU
  • Code Sample Discussion: Massively Parallel Random Number Test Framework

3. May 1st, 10AM PDT: Accelerate Rendering by an Order of Magnitude with OpenCL, Plus a View to the Multi-core and Web-enabled Future

  • How to use OpenCL to Provide High-Quality, Fast Rendering in Combination with Primitive Restart
  • Device Fission, Partitioning Hardware Capabilities for Optimal Resource Usage
  • Looking to the Future – WebCL

Registration is limited. More Information: http://developer.amd.com/zones/OpenCLZone/Events/pages/OpenCLWebinars.aspx

Latest PGI Compilers support OpenACC and CUDA for x86

March 6th, 2012

HPCWire reports:

PORTLAND, Ore., March 5 — The Portland Group, a wholly-owned subsidiary of STMicroelectronics, today announced availability of the 2012 release of the PGI line of high-performance parallelizing compilers and development tools for Linux, OS X and Windows. PGI 2012 is the first general release to include support for the OpenACC directive-based programming model for NVIDIA CUDA-enabled Graphics Processing Units (GPUs). This release is also the first to include the fully feature-enabled PGI CUDA C/C++ compiler for multi-core x64 CPUs from Intel and AMD. In addition, PGI 2012 includes a number of performance and feature enhancements for multi-core x64 processor-based HPC systems.

 

Acceleware OpenCL™ Training in NYC

February 28th, 2012

Developed in partnership with AMD, this four day course is designed for GPU Programmers who are looking to develop comprehensive skills in writing and optimizing applications that fully leverage the multi-core processing capabilities of the GPU.

Delivered by Acceleware’s Developers, who provide real world experience and examples, the training comprises classroom lectures and hands-on tutorials. Each student will be supplied with a laptop equipped with an AMD Fusion APU for the duration of the course. Small class sizes maximize learning and ensure a personal educational experience. Read the rest of this entry »

SpeedIT 2.0 released

February 24th, 2012

SpeedIT 2.0 and the SpeedIT plugin to OpenFOAM have been released. New features include:

  • One of the fastest Sparse Matrix Vector Multiplication worldwide.
  • Faster Conjugate Gradient and BiConjugate Gradient solvers.
  • State-of-the-art CMRS format for storing sparse matrices. The format requires less memory than CRS or HYB (from CUSPARSE and CUSP).
  • Faster acceleration in OpenFOAM (Computational Fluid Dynamics).

More information is available at http://speed-it.vratis.com.

Page 1 of 3312345...102030...Last »