This webinar recording provides an overview of the profiling techniques and the tools available to help you optimize your code. It examines NVIDIA’s Visual Profiler and cuobjdump and highlight the various methods available for understanding the performance of CUDA program. The second part of the session focuses on debugging techniques and the tools available to help identify issues in kernels. The debugging tools provided in CUDA 5.5 including NSight and cuda-memcheck are discussed. The webinar recording can be accessed here.
The latest release 1.5.0 of the free open source linear algebra library ViennaCL is now available for download. The library provides a high-level C++ API similar to Boost.ublas and aims at providing the performance of accelerators at a high level of convenience without having to deal with hardware details. Some of the highlights from the ChangeLog are as follows: Vectors and matrices of integers are now supported, multiple OpenCL contexts can be used in a fully multi-threaded manner, products of sparse and dense matrices are now available, and certain BLAS functionality is also provided through a shared library for use with programming languages other than C++, e.g. C, Fortran, or Python.
Webinar: How to Improve Performance using the CUDA Memory Model and Features of the Kepler ArchitectureDecember 20th, 2013
This webinar explores the memory model of the GPU and the memory enhancements available in the Kepler architecture, and how these will affect performance optimization. The webinar begins with an essential overview of GPU architecture and thread cooperation before focusing on the different memory types available on the GPU. We define shared, constant and global memory and discuss the best locations to store your application data for optimized performance. The shuffle instruction, new shared memory configurations and Read-Only Data Cache of the Kepler architecture are introduced and optimization techniques discussed. Click here to view the webinar recording.
G-BLASTN is a GPU-accelerated nucleotide alignment tool based on the widely used NCBI-BLAST. G-BLASTN can produce exactly the same results as NCBI-BLAST, and it also has very similar user commands. It also supports a pipeline mode, which can fully use the GPU and CPU resources when handling a batch of medium to large sized queries. Currently, G_BLASTN supports the blastn and megablast modes of NCBI-BLAST. The discontiguous megablast mode is not supported yet. More information: http://www.comp.hkbu.edu.hk/~chxw/software/G-BLASTN.html
VexCL is a modern C++ library created for ease of GPGPU development with C++. VexCL strives to reduce the amount of boilerplate code needed to develop GPGPU applications. The library provides a convenient and intuitive notation for vector arithmetic, reduction, sparse matrix-vector multiplication, etc. The source code is available under the permissive MIT license. As of v1.0.0, VexCL provides two backends: OpenCL and CUDA. Users may choose either of those at compile time with a preprocessor macro definition. More information is available at the GitHub project page and release notes page.
AMD CodeXL is a free set of tools for GPU debugging, GPU profiling, static analysis of OpenCL kernels, and CPU profiling, including support for remote servers. For more information and download links, see: http://developer.amd.com/community/blog/2013/11/08/codexl-1-3-released/
Bolt is an STL compatible C++ template library for creating data-parallel applications using C++ (no C++ AMP / OpenCL code required). For more information about the Bolt template library and download links, see: http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/bolt-c-template-library/
AMD APP SDK has everything needed to get started with OpenCL and parallel programming. It includes OpenCL samples that are very easy to compile, as well as the Bolt and other libraries. For more information about AMD APP SDK and download links, see: http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/
Allinea DDT is part of Allinea Software’s unified tools platform, which provides a single powerful and intuitive environment for debugging and profiling of parallel and multithreaded applications. It is widely used by computational scientists and scientific programmers to fix software defects of parallel applications running on hybrid GPU clusters and supercomputers. DDT 4.1.1 supports CUDA 5.5, C++11 and the GNU 4.8 compilers. Also introduced with Allinea DDT 4.1.1 is CUDA toolkit debugging support for ARMv7 architectures. More information: http://www.allinea.com
The Libra 3.0 Heterogeneous Cloud Computing SDK has recently been released by GPU Systems. It supports PC, Tablet and Mobile Devices and includes a new virtualizing function for cloud compute services of local and remote CPUs and GPUs. C/C++, Java, C# and Matlab are supported. Read the full press release here.
One of the keys to achieving maximum performance in CUDA is taking advantage of the various memory spaces. Part II of Acceleware’s tutorial has now been published. The tutorial uses a simple encryption kernel to test and compare read-only cache, constant cache and global memory. Read the full tutorial…