This webinar provides an overview of the improved analysis performance tools available in CUDA 6.0 and key optimization strategies for compute, latency and memory bound problems. The webinar includes techniques for ensuring peak utilization of CUDA cores, how to improve branching efficiency, intrinsic functions and loop unrolling. Optimal access patterns for global and shared memory are presented, including a comparison between the Fermi and Kepler architectures. To view the webinar go to: http://acceleware.com/blog/webinar-essential-cuda-optimization-techniques
Developed in partnership with NVIDIA, this hands-on four day course will teach you how to write and optimize applications that fully leverage the multi-core processing capabilities of the GPU. This course will have a finance focus. Commonly used algorithms such as random number generation and Monte Carlo simulations will be used and profiled in examples. A background in finance is not necessary. For more information please visit: http://acceleware.com/training/988
The Cf4ocl project is a GPLv3/LGPLv3 initiative to provide an object-oriented interface to the OpenCL C API with integrated profiling, promoting the rapid development of OpenCL host programs and avoiding boilerplate code. Its main goal is to allow developers to focus on OpenCL device code. After two alpha releases, the first beta is out, and can be tested on Linux, Windows and OS X. The framework is independent of the OpenCL platform version and vendor, and includes utilities to simplify the analysis of the OpenCL environment and of kernel requirements. While the project is making progress, it doesn’t yet offer OpenGL/DirectX interoperability, support for sub-devices, and doesn’t support pipes and SVM.
Cf4ocl can be downloaded from http://fakenmc.github.io/cf4ocl/.
Version 2.0 of OpenCLIPP, an Open Source OpenCL library for computer vision and image processing primitives, bas been released. For more information about the library, for programming contributions and for download, please refer to the OpenCLIPP Website.
This tutorial will begin with a brief overview of OpenCL and data-parallelism before focusing on the GPU programming model. We will explore the fundamentals of GPU kernels, host and device responsibilities, OpenCL syntax and work-item hierarchy. For more information and to register visit: http://acceleware.com/event/introduction-opencl-using-amd-gpus
CUDPP release 2.2 is a feature release that adds a new parallel primitive and improves some existing primitives. We have added cudppSuffixArray, a parallel skew algorithm (SA) implementation that computes the suffix array of a string. This suffix array primitive is now used in burrowsWheelerTransform, delivering better performance than CUDPP 2.1’s use of cudppStringSort. The new BWT is further used in cudppCompress, which is now faster than the original parallel compression and supports compression of text containing all possible unsigned char values. Some bugs in cudppMoveToFrontTransform and cudppStringSort have also been fixed. OS X users might also be interested in how we supported the use of OS X’s clang compiler in OS X Mavericks (10.9).
SpeedIT FLOW is a RANS single-phase fluid flow solver that runs fully on GPU. Benchmark results on external aero flow and other industry-relevant OpenFOAM cases on a GPU card indicate approximately 3x faster time to solution vs. Intel Xeon E5649 running 12 cores. This is about two times faster than competing solutions that offer only partial acceleration on GPU. More details are available on this blog.
This hands-on four day course teaches how to write and optimize applications that fully leverage the multi-core processing capabilities of the GPU. More details and registration: http://acceleware.com/training/986
Hybrid Fortran is an Open Source directive based extension for the Fortran language. It is a way for HPC programmers to keep writing Fortran code like they are used to – only now with GPGPU support. It achieves performance portability by allowing different storage orders and loop structures for the CPU and GPU version. All computational code stays the same as in the respective CPU version, e.g. it can be kept in a low dimensionality even when the GPU version needs to be privatised in more dimensions in order to achieve a speedup. Hybrid Fortran takes care of the necessary transformations at compile-time (so there is no runtime overhead). A (python based) preprocessor parses these annotations together with the Fortran user code structure, declarations, accessors and procedure calls, and then writes separate versions of the code – once for CPU with OpenMP parallelization and once for GPU with CUDA Fortran. More details: http://typhooncomputing.com/?p=416
The course on Antenna Synthesis (with elements of GPU computing) is organized in the framework of the European School of Antennas. The course will take place at the Partenope Conference Center of the Università di Napoli Federico II, Napoli, Italy, on October 13-17, 2014. It faces three topics corresponding to the two main aspects of Antenna Synthesis, namely external and internal synthesis, and to numerical and implementation issues on High Performance Computing (HPC) platforms of synthesis algorithms. For details about the course please see this brochure and http://www.antennasvce.org/Community/Education/Courses?id_folder=533.