Towards Performance-Portable, Scalable and Convenient Linear Algebra

August 16th, 2013


The rise of multi- and many-core architectures also gave birth to a plethora of new parallel programming models. Among these, the open industry standard OpenCL addresses this heterogeneity of programming environments by providing a unified programming framework. The price to pay, however, is that OpenCL requires additional low-level boilerplate code, when compared to vendor-specific solutions, even if only simple operations are to be performed. Also, the unified programming framework does not automatically provide any guarantees on performance portability of a particular implementation. Thus, device-specific compute kernels are still required for obtaining good performance across different hardware architectures.
We address both, the issue of programmability and portable performance, in this work: On the one hand, a high-level programming interface for linear algebra routines allows for the convenient specification of the operations of interest without having to go into the details of the underlying hardware. On the other hand, we discuss the underlying generator for device-specific OpenCL kernels at runtime, which is supplemented by an auto-tuning framework for portable performance as well as with work partitioning and task scheduling for multiple devices. Our benchmark results show portable performance across hardware from major vendors. In all cases, at least 75 percent of the respective vendor tuned library was obtained, while in some cases we even outperformed the reference. We further demonstrate the convenient and efficient use of our high-level interface in a multi-device setting with good scalability.

(Philippe Tillet, Karl Rupp, Siegfried Selberherr, Chin-Teng Lin: “Towards Performance-Portable, Scalable, and Convenient Linear Algebra”. 5th USENIX Workshop on Hot Topics in Parallelism (HotPar’) 2013 [PDF].)

Webinar: Accelerating High Performance Computing with GPUDirect RDMA

August 4th, 2013

This webinar, scheduled for Wednesday, August 7 at 10 a.m. PDT, will cover the latest schedule for GPUDirect RDMA, scaling and optimization techniques for maximizing application performance using MVAPICH2, and the latest advancements of CUDA. Join speakers from Ohio State University, NVIDIA and Mellanox Technologies. Register by visiting

rCUDA now available for the ARM architecture

July 26th, 2013

The rCUDA team is glad to announce that its remote GPU virtualization technology now supports the ARM processor architecture. The new release of rCUDA for this low-power processor has been developed for the Ubuntu 11.04 and Ubuntu 12.04 ARM linux distributions. With this new rCUDA release, it is also possible to leverage hybrid platforms where the application uses ARM CPUs while requesting acceleration services provided by remote GPUs installed in x86 nodes. The opposite is also possible: an application running in an x86 computer can access remote GPUs attached to ARM systems. Please visit rCUDA website for more information or for requesting a free copy of the rCUDA middleware.

Back Testing of HFT Strategies with Xcelerit and GPUs

July 26th, 2013

Algorithmic trading has become ever more popular in recent years – accounting for approximately half of all European and American stock trades placed in 2012. The trading strategies need to be back-tested regularly using historical market data for calibration and to check the expected return and risk. This is a computationally demanding process that can take hours to complete. However, back-testing the strategies frequently intra-day can significantly increase the profits for the trading institution.

Read the rest of this entry »

OpenCV and CUDA webinar, July 30th

July 23rd, 2013

Anatoly Baksheev, OpenCV GPU Module Team Leader at Itseez will demonstrate how to obtain and build OpenCV, its GPU module, and the sample programs. You will learn how to use the OpenCV GPU module and create your own custom GPU functions for OpenCV. Register for the July 30th webinar:

Heterogeneous compute event during Siggraph 2013

July 19th, 2013

The HSA Foundation will be hosting a Birds of a Feather session on heterogeneous computing on July 24 from 1-2 p.m., at the Anaheim Convention Center, Room 202B. For more info:

GPU Technology Conference 2014 Call for Submissions

July 14th, 2013

GPU Technology Conference (GTC) is NVIDIA’s annual developer event and consistently attracts the world’s best and brightest GPU developers, creating opportunities for connection and learning through technical sessions and in-depth tutorials in science, professional graphics, game development, mobile computing, cloud computing and automotive applications, as well as first-hand interactions with peers, luminaries, and emerging and established companies.

If you are doing innovative work using GPU, please submit a proposal at

The deadline is Friday, September 27.

Acceleware Training

July 14th, 2013

Acceleware recently announced a couple of courses:

  • CUDA for Finance: December 10 – 13, 2013, New York, NY [Details]
  • OpenCL: October 22 – 25, 2013, Houston, TX [details]
  • CUDA: September 24-27, [Details]
  • C++ AMP: September 10-13, [Details]


Acceleration of iterative Navier-Stokes solvers on graphics processing units

July 14th, 2013


While new power-efficient computer architectures exhibit spectacular theoretical peak performance, they require specific conditions to operate efficiently, which makes porting complex algorithms a challenge. Here, we report results of the semi-implicit method for pressure linked equations (SIMPLE) and the pressure implicit with operator splitting (PISO) methods implemented on the graphics processing unit (GPU). We examine the advantages and disadvantages of the full porting over a partial acceleration of these algorithms run on unstructured meshes. We found that the full-port strategy requires adjusting the internal data structures to the new hardware and proposed a convenient format for storing internal data structures on GPUs. Our implementation is validated on standard steady and unsteady problems and its computational efficiency is checked by comparing its results and run times with those of some standard software (OpenFOAM) run on central processing unit (CPU). The results show that a server-class GPU outperforms a server-class dual-socket multi-core CPU system running essentially the same algorithm by up to a factor of 4.

See also supplementary materials and the follow up at

(Tadeusz Tomczak, Katarzyna Zadarnowska, Zbigniew Koza, Maciej Matyka and Łukasz Mirosław: “Acceleration of iterative Navier-Stokes solvers on graphics processing units”, International Journal of Computational Fluid Dynamics, accepted, July 2013. [DOI])

AMD Releases APP SDK 2.8.1 with support for Bolt C++ Template Library, OpenCV, and GCN

July 14th, 2013

From a recent press release:

AMD’s APP SDK is an essential resource for developers who wish to leverage the processing power of heterogeneous computing. OpenCL™ is the primary mechanism for achieving this today, but AMD’s goal is to enable developers to accelerate applications with the programming paradigm of their choice. Toward that end, AMD has added support for heterogeneous libraries such as the newly released Bolt open source C++ template library and OpenCV computer vision library which now includes heterogeneous acceleration.

New to APP SDK 2.8.1:

Bolt: With the recent launch of Bolt 1.0, AMD has added several samples to the APP SDK to demonstrate Bolt 1.0 features. These showcase the usage of Bolt APIs such as scan, sort, reduce and transform. Other new samples highlight the ease of porting from STL and the performance benefits achieved over equivalent STL implementations. We’ve also included samples to demonstrate the different fallback options available in Bolt 1.0 when no GPU is available which ensure your code runs correctly on any platform.

OpenCV: AMD has been working closely with the OpenCV open source community to add heterogeneous acceleration capability to the world’s most popular computer vision library. These changes are already integrated into OpenCV and are readily available for developers who want to improve performance and efficiency of their computer vision applications. AMD has included samples to illustrate these improvements and highlight how simple it is to include them in your app.

GCN: AMD recently launched its new Graphics Core Next (GCN) architecture on several AMD products. GCN is based on a scalar architecture vs. the VLIW vector architecture of prior generations, so hand-tuned vectorization to optimize hardware utilization is no longer needed. We’ve modified several samples in AMD APP SDK 2.8.1 to show the ease of writing scalar code as compared to vectorization.

For more information, see

Page 11 of 110« First...910111213...203040...Last »