This blog takes a closer look at constant cache and read-only cache. It highlights the differences between the two memory types and what circumstances they perform best in. Read the whole story here.
Sparse matrix-vector multiplication (spMVM) is the most time-consuming kernel in many numerical algorithms and has been studied extensively on all modern processor and accelerator architectures. However, the optimal sparse matrix data storage format is highly hardware-specific, which could become an obstacle when using heterogeneous systems. Also, it is as yet unclear how the wide single instruction multiple data (SIMD) units in current multi- and many-core processors should be used most efficiently if there is no structure in the sparsity pattern of the matrix. We suggest SELL-C-sigma, a variant of Sliced ELLPACK, as a SIMD-friendly data format which combines long-standing ideas from General Purpose Graphics Processing Units (GPGPUs) and vector computer programming. We discuss the advantages of SELL-C-sigma compared to established formats like Compressed Row Storage (CRS) and ELLPACK, and show its suitability on a variety of hardware platforms (Intel Sandy Bridge, Intel Xeon Phi and Nvidia Tesla K20) for a wide range of test matrices from different application areas. Using appropriate performance models we develop deep insight into the data transfer properties of the SELL-C-sigma spMVM kernel. SELL-C-sigma comes with two tuning parameters whose performance impact across the range of test matrices is studied and for which reasonable choices are proposed. This leads to a hardware-independent (“catch-all”) sparse matrix format, which achieves very high efficiency for all test matrices across all hardware platforms.
(M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop: “A unified sparse matrix data format for modern processors with wide SIMD units.” Submitted, July 2013 [preprint])
Acceleware, partnering with NVIDIA, if offering a four-day CUDA training course designed for programmers in the oil and gas industry who are looking to develop comprehensive skills in writing and optimizing applications that fully leverage the many-core processing capabilities of the GPU. Commonly used algorithms in oil and gas such as filtering and FFTs will be used and profiled in the examples. The case study on day 4 will focus on efficient implementation of 3D convolution, which is highly applicable to reverse time migration. A background in oil and gas is not necessary. Register here.
Acceleware training program manager Kelly Goss is hosting a 4 day C++ AMP training course in Boston, MA September 10 -13. The course will cover an overview of GPU computing, C++ 11 Lambda syntax, data-parallel architectures and the C++ AMP programming model. Every day consists of lectures and hands-on exercises (laptops for student’s use provided for the duration of the course). Click here for full details and registration.
This webinar, scheduled for Wednesday, August 7 at 10 a.m. PDT, will cover the latest schedule for GPUDirect RDMA, scaling and optimization techniques for maximizing application performance using MVAPICH2, and the latest advancements of CUDA. Join speakers from Ohio State University, NVIDIA and Mellanox Technologies. Register by visiting www.gputechconf.com/gtcexpress.
The rCUDA team is glad to announce that its remote GPU virtualization technology now supports the ARM processor architecture. The new release of rCUDA for this low-power processor has been developed for the Ubuntu 11.04 and Ubuntu 12.04 ARM linux distributions. With this new rCUDA release, it is also possible to leverage hybrid platforms where the application uses ARM CPUs while requesting acceleration services provided by remote GPUs installed in x86 nodes. The opposite is also possible: an application running in an x86 computer can access remote GPUs attached to ARM systems. Please visit rCUDA website for more information or for requesting a free copy of the rCUDA middleware.
Algorithmic trading has become ever more popular in recent years – accounting for approximately half of all European and American stock trades placed in 2012. The trading strategies need to be back-tested regularly using historical market data for calibration and to check the expected return and risk. This is a computationally demanding process that can take hours to complete. However, back-testing the strategies frequently intra-day can significantly increase the profits for the trading institution.
Anatoly Baksheev, OpenCV GPU Module Team Leader at Itseez will demonstrate how to obtain and build OpenCV, its GPU module, and the sample programs. You will learn how to use the OpenCV GPU module and create your own custom GPU functions for OpenCV. Register for the July 30th webinar: http://goo.gl/5V3eA
The Thrust team is pleased to announce the release of Thrust v1.7, an open-source C++ library for developing high-performance parallel applications. Modeled after the C++ Standard Template Library, Thrust brings a familiar abstraction layer to the realm of parallel computing
Thrust 1.7.0 introduces a new interface for controlling algorithm execution as well as several new algorithms and performance improvements. With this new interface, users may directly control how algorithms execute as well as details such as the allocation of temporary storage. Key/value versions of thrust::merge and the set operation algorithms have been added, as well stencil versions of partitioning algorithms. For 32b types, new CUDA merge and set operations provide 2-15x faster performance while a new CUDA comparison sort provides 1.3-4x faster performance.
Thrust is open-source software distributed under the OSI-approved Apache License 2.0.
LibBi is a software package for state-space modelling and Bayesian inference on modern computer hardware, including multi-core central processing units (CPUs), many-core graphics processing units (GPUs) and distributed-memory clusters of such devices. The software parses a domain-specific language for model specification, then optimises, generates, compiles and runs code for the given model, inference method and hardware platform. In presenting the software, this work serves as an introduction to state-space models and the specialised methods developed for Bayesian inference with them. The focus is on sequential Monte Carlo (SMC) methods such as the particle filter for state estimation, and the particle Markov chain Monte Carlo (PMCMC) and SMC^2 methods for parameter estimation. All are well-suited to current computer hardware. Two examples are given and developed throughout, one a linear three-element windkessel model of the human arterial system, the other a nonlinear Lorenz ’96 model. These are specified in the prescribed modelling language, and LibBi demonstrated by performing inference with them. Empirical results are presented, including a performance comparison of the software with different hardware configurations.
(Lamwrence M. Murray: “Bayesian state-space modelling on high-performance hardware using LibBi”, Preprint, June 2013. [arXiv])
This 1-hour webinar (June 11, 10am-11am PST) introduces the powerful OpenCV library, shows how this library has been accelerated using CUDA on NVIDIA GPUs, and demonstrates how to use the OpenCV GPU library to create lightning-fast applications. Free registration: http://bit.ly/11eqoaJ