A unified sparse matrix data format for modern processors with wide SIMD units

September 4th, 2013

Abstract:

Sparse matrix-vector multiplication (spMVM) is the most time-consuming kernel in many numerical algorithms and has been studied extensively on all modern processor and accelerator architectures. However, the optimal sparse matrix data storage format is highly hardware-specific, which could become an obstacle when using heterogeneous systems. Also, it is as yet unclear how the wide single instruction multiple data (SIMD) units in current multi- and many-core processors should be used most efficiently if there is no structure in the sparsity pattern of the matrix. We suggest SELL-C-sigma, a variant of Sliced ELLPACK, as a SIMD-friendly data format which combines long-standing ideas from General Purpose Graphics Processing Units (GPGPUs) and vector computer programming. We discuss the advantages of SELL-C-sigma compared to established formats like Compressed Row Storage (CRS) and ELLPACK, and show its suitability on a variety of hardware platforms (Intel Sandy Bridge, Intel Xeon Phi and Nvidia Tesla K20) for a wide range of test matrices from different application areas. Using appropriate performance models we develop deep insight into the data transfer properties of the SELL-C-sigma spMVM kernel. SELL-C-sigma comes with two tuning parameters whose performance impact across the range of test matrices is studied and for which reasonable choices are proposed. This leads to a hardware-independent (“catch-all”) sparse matrix format, which achieves very high efficiency for all test matrices across all hardware platforms.

(M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop: “A unified sparse matrix data format for modern processors with wide SIMD units.” Submitted, July 2013 [preprint])

Acceleware offers upcoming CUDA and C++ AMP training events

September 2nd, 2013

Acceleware, partnering with NVIDIA, if offering a four-day CUDA training course designed for programmers in the oil and gas industry who are looking to develop comprehensive skills in writing and optimizing applications that fully leverage the many-core processing capabilities of the GPU. Commonly used algorithms in oil and gas such as filtering and FFTs will be used and profiled in the examples. The case study on day 4 will focus on efficient implementation of 3D convolution, which is highly applicable to reverse time migration. A background in oil and gas is not necessary. Register here.

Acceleware training program manager Kelly Goss is hosting a 4 day C++ AMP training course in Boston, MA September 10 -13. The course will cover an overview of GPU computing, C++ 11 Lambda syntax, data-parallel architectures and the C++ AMP programming model. Every day consists of lectures and hands-on exercises (laptops for student’s use provided for the duration of the course).  Click here for full details and registration.

Towards Performance-Portable, Scalable and Convenient Linear Algebra

August 16th, 2013

Abstract:

The rise of multi- and many-core architectures also gave birth to a plethora of new parallel programming models. Among these, the open industry standard OpenCL addresses this heterogeneity of programming environments by providing a unified programming framework. The price to pay, however, is that OpenCL requires additional low-level boilerplate code, when compared to vendor-specific solutions, even if only simple operations are to be performed. Also, the unified programming framework does not automatically provide any guarantees on performance portability of a particular implementation. Thus, device-specific compute kernels are still required for obtaining good performance across different hardware architectures.
We address both, the issue of programmability and portable performance, in this work: On the one hand, a high-level programming interface for linear algebra routines allows for the convenient specification of the operations of interest without having to go into the details of the underlying hardware. On the other hand, we discuss the underlying generator for device-specific OpenCL kernels at runtime, which is supplemented by an auto-tuning framework for portable performance as well as with work partitioning and task scheduling for multiple devices. Our benchmark results show portable performance across hardware from major vendors. In all cases, at least 75 percent of the respective vendor tuned library was obtained, while in some cases we even outperformed the reference. We further demonstrate the convenient and efficient use of our high-level interface in a multi-device setting with good scalability.

(Philippe Tillet, Karl Rupp, Siegfried Selberherr, Chin-Teng Lin: “Towards Performance-Portable, Scalable, and Convenient Linear Algebra”. 5th USENIX Workshop on Hot Topics in Parallelism (HotPar’) 2013 [PDF].)

Webinar: Accelerating High Performance Computing with GPUDirect RDMA

August 4th, 2013

This webinar, scheduled for Wednesday, August 7 at 10 a.m. PDT, will cover the latest schedule for GPUDirect RDMA, scaling and optimization techniques for maximizing application performance using MVAPICH2, and the latest advancements of CUDA. Join speakers from Ohio State University, NVIDIA and Mellanox Technologies. Register by visiting www.gputechconf.com/gtcexpress.

rCUDA now available for the ARM architecture

July 26th, 2013

The rCUDA team is glad to announce that its remote GPU virtualization technology now supports the ARM processor architecture. The new release of rCUDA for this low-power processor has been developed for the Ubuntu 11.04 and Ubuntu 12.04 ARM linux distributions. With this new rCUDA release, it is also possible to leverage hybrid platforms where the application uses ARM CPUs while requesting acceleration services provided by remote GPUs installed in x86 nodes. The opposite is also possible: an application running in an x86 computer can access remote GPUs attached to ARM systems. Please visit rCUDA website for more information or for requesting a free copy of the rCUDA middleware.

Back Testing of HFT Strategies with Xcelerit and GPUs

July 26th, 2013

Algorithmic trading has become ever more popular in recent years – accounting for approximately half of all European and American stock trades placed in 2012. The trading strategies need to be back-tested regularly using historical market data for calibration and to check the expected return and risk. This is a computationally demanding process that can take hours to complete. However, back-testing the strategies frequently intra-day can significantly increase the profits for the trading institution.

Read the rest of this entry »

OpenCV and CUDA webinar, July 30th

July 23rd, 2013

Anatoly Baksheev, OpenCV GPU Module Team Leader at Itseez will demonstrate how to obtain and build OpenCV, its GPU module, and the sample programs. You will learn how to use the OpenCV GPU module and create your own custom GPU functions for OpenCV. Register for the July 30th webinar: http://goo.gl/5V3eA

Heterogeneous compute event during Siggraph 2013

July 19th, 2013

The HSA Foundation will be hosting a Birds of a Feather session on heterogeneous computing on July 24 from 1-2 p.m., at the Anaheim Convention Center, Room 202B. For more info: http://slidesha.re/16JSqK7

GPU Technology Conference 2014 Call for Submissions

July 14th, 2013

GPU Technology Conference (GTC) is NVIDIA’s annual developer event and consistently attracts the world’s best and brightest GPU developers, creating opportunities for connection and learning through technical sessions and in-depth tutorials in science, professional graphics, game development, mobile computing, cloud computing and automotive applications, as well as first-hand interactions with peers, luminaries, and emerging and established companies.

If you are doing innovative work using GPU, please submit a proposal at https://gtc2014.consenseus.com/

The deadline is Friday, September 27.

Acceleware Training

July 14th, 2013

Acceleware recently announced a couple of courses:

  • CUDA for Finance: December 10 – 13, 2013, New York, NY [Details]
  • OpenCL: October 22 – 25, 2013, Houston, TX [details]
  • CUDA: September 24-27, [Details]
  • C++ AMP: September 10-13, [Details]

 

Page 10 of 109« First...89101112...203040...Last »