Optimizing CUDA Code By Kernel Fusion — Application on BLAS

May 11th, 2013

Abstract:

Modern GPUs are able to perform significantly more arithmetic operations than transfers of a single word to or from global memory. Hence, many GPU kernels are limited by memory bandwidth and cannot exploit the arithmetic power of GPUs. However, the memory locality can be often improved by kernel fusion when a sequence of kernels is executed and some kernels in this sequence share data. In this paper, we show how kernels performing map, reduce or their nested combinations can be fused automatically by our source-to-source compiler. To demonstrate the usability of the compiler, we have implemented several BLAS-1 and BLAS-2 routines and show how the performance of their sequences can be improved by fusions. Compared to similar sequences using CUBLAS, our compiler is able to generate code that is up to 2.61x faster for the examples tested.

(J. Filipovič, M. Madzin, J. Fousek, L. Matyska: “Optimizing CUDA Code By Kernel Fusion – Application on BLAS”, submitted to Parallel Computing, May 2013.  [preprint])

ViennaCL 1.4.0 with CUDA, OpenCL and OpenMP support

December 3rd, 2012

The latest release 1.4.0 of the free open-source linear algebra library ViennaCL features the following highlights:

  • Two computing backends in addition to OpenCL: CUDA and OpenMP
  • Improved performance for (Block-) ILU0/ILUT preconditioners
  • Optional level scheduling for ILU substitutions on GPUs
  • Mixed-precision CG solver
  • Initializer types from Boost.uBLAS (unit_vector, zero_vector, etc.)

Any contributions of fast CUDA or OpenCL computing kernels for future releases of ViennaCL are welcome! More information is available at http://viennacl.sourceforge.net.

Accelerate OpenFOAM with SpeedIT 2.1

May 24th, 2012

SpeedIT provides a set of accelerated solvers for sparse linear systems of equations. The library supports C/C++ and Fortran, and it can be used with OpenFOAM to accelerate CFD simulations. SpeedIT 2.1 contains two new preconditioners:

• Algebraic Multigrid with Smoothed Aggregation (AMG)
• Approximate Inverse (AINV)

OpenFOAM simulations on the GPU can be up to 3.5x faster compared to CG and DIC/DILU preconditioners on the CPU and up to 1.6x faster if you run GAMG.

See the SpeedIT website and blog for more details.

Adaptive Row-Grouped CSR Format For Storing of Sparse Matrices on GPU

April 1st, 2012

Abstract:

We present a new adaptive format for storing sparse matrices on GPU. We compare it with several other formats including CUSPARSE which is today probably the best choice for processing of sparse matrices on GPU in CUDA. Contrary to CUSPARSE which works with common CSR format, our new format requires conversion. However, multiplication of sparse-matrix and vector is significantly faster for many matrices. We demonstrate it on a set of 1600 matrices and we show for what types of matrices our format is profitable.

(Heller M., Oberhuber T., “Adaptive Row-Grouped CSR Format For Storing of Sparse Matrices on GPU“, preprint on Arxiv.org 2012, [PDF])

ViennaCL 1.2.0 released

January 2nd, 2012

Version 1.2.0 of the OpenCL-based C++ linear algebra library ViennaCL is now available for download! It features a high-level interface compatible with Boost.ublas, which allows for compact code and high productivity. Highlights of the new release are the following features (all experimental):

  • Several algebraic multigrid preconditioners
  • Sparse approximate inverse preconditioners
  • Fast Fourier transform
  • Structured dense matrices (circulant, Hankel, Toeplitz, Vandermonde)
  • Reordering algorithms (Cuthill-McKee, Gibbs-Poole-Stockmeyer)
  • Proxies for manipulating subvectors and submatrices

The features are expected to reach maturity in the 1.2.x branch. More information about the library including download links is available at http://viennacl.sourceforge.net.

On the Acceleration of Wavefront Applications using Distributed Many-Core Architectures

December 14th, 2011

Abstract:

In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applications—a ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P).

Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures.

(Pennycook, S.J., Hammond, S.D., Mudalige, G.R., Wright, S.A. and Jarvis, S.A.: “On the Acceleration of Wavefront Applications using Distributed Many-Core Architectures”,  The Computer Journal (in press) [DOI] [PREPRINT])

Non negative least squares on GPU/multicore architectures

September 4th, 2011

Abstract:

We parallelize a version of the active-set iterative algorithm derived from the original works of Lawson and Hanson (1974) on multi-core architectures. This algorithm requires the solution of an unconstrained least squares problem in every step of the iteration for a matrix composed of the passive columns of the original system matrix. To achieve improved performance, we use parallelizable procedures to efficiently update and {\em downdate} the QR factorization of the matrix at each iteration, to account for inserted and removed columns. We use a reordering strategy of the columns in the decomposition to reduce computation and memory access costs. We consider graphics processing units (GPUs) as a new mode for efficient parallel computations and compare our implementations to that of multi-core CPUs. Both synthetic and non-synthetic data are used in the experiments.

(Yuancheng Luo and Ramani Duraiswami, “Efficient Parallel Non-Negative Least Squares on Multicore Architectures”, SIAM Journal on Scientific Computing, accepted, Sep. 2011. [PDF] [Source code])

Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs

August 19th, 2011

Abstract:

GPUs are excellent accelerators for data parallel applications with regular data access patterns. It is challenging, however, to optimize computations with irregular data access patterns on GPUs. One such computation is the Symmetric Matrix Vector product (SYMV) for dense linear algebra. Optimizing the SYMV kernel is important because it forms the basis of fundamental algorithms such as linear solvers and eigenvalue solvers on symmetric matrices. In this work, we present a new algorithm for optimizing the SYMV kernel on GPUs. Our optimized SYMV in single precision brings up to a 7x speed up compared to the (latest) CUBLAS 4.0 NVIDIA library on the GTX 280 GPU. Our SYMV kernel tuned for Fermi C2050 is 4.5x faster than CUBLAS 4.0 in single precision and 2x faster than CUBLAS 4.0 in double precision. Moreover, the techniques used and described in the paper are general enough to be of interest for developing high-performance GPU kernels beyond the particular case of SYMV.

(R. Nath, S. Tomov, T. Dong, and J. Dongarra, “Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs”, accepted for SC’11.  [WWW] [PDF])

GPU Linear Solvers for OpenFOAM

May 4th, 2011

ofgpu is a free GPL library from Symscape that provides GPU linear solvers for OpenFOAM®. The experimental library targets NVIDIA CUDA devices on Windows, Linux, and (untested) Mac OS X. It uses the Cusp library’s Krylov solvers to produce equivalent GPU (CUDA-based) versions of the standard OpenFOAM linear solvers:

  • PCG – Preconditioned conjugate gradient solver for symmetric matrices (e.g., p)
  • PBiCG – Preconditioned biconjugate gradient solver for asymmetric matrices (e.g., Ux, k)

ofgpu also has support for the OpenFOAM preconditioners:

  • no
  • diagonal

For more details see “GPU Linear Solver Library for OpenFOAM”. OpenFOAM is a registered trademark of OpenCFD and is unaffiliated with Symscape.

A memory efficient and fast sparse matrix vector product on a GPU

May 4th, 2011

Abstract:

This paper proposes a new sparse matrix storage format which allows an efficient implementation of a sparse matrix vector product on a Fermi Graphics Processing Unit (GPU). Unlike previous formats it has both low memory footprint and good throughput. The new format, which we call Sliced ELLR-T has been designed specifically for accelerating the iterative solution of a large sparse and complex-valued system of linear equations arising in computational electromagnetics. Numerical tests have shown that the performance of the new implementation reaches 69 GFLOPS in complex single precision arithmetic. Compared to the optimized six core Central Processing Unit (CPU) (Intel Xeon 5680) this performance implies a speedup by a factor of six. In terms of speed the new format is as fast as the best format published so far and at the same time it does not introduce redundant zero elements which have to be stored to ensure fast memory access. Compared to previously published solutions, significantly larger problems can be handled using low cost commodity GPUs with limited amount of on-board memory.

(A. Dziekonski, A. Lamecki, and M. Mrozowski: “A memory efficient and fast sparse matrix vector product on a GPU“, Progress In Electromagnetics Research, Vol. 116, 49-63, 2011. [PDF])

Page 1 of 512345