Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs

April 10th, 2013


We describe an interface and an implementation for performing Kronecker product actions on NVIDIA GPUs for multiple small 2-D matrices and 3-D arrays processed in parallel as a batch. This method is suited to cases where the Kronecker product component matrices are identical but the operands in a matrix-free application vary in the batch. Any batched GEMM (General Matrix Multiply) implementation, for example ours or the one in cuBLAS, can also be used for performing batched Kronecker products on GPUs. However, the specialized implementation presented here is faster and uses less memory. Partly this is because a simple GEMM based approach would require extra copies to and from main memory. We focus on matrix sizes less than or equal to 16, since these are the typical polynomial degrees in Finite Elements, but the implementation can be easily extended for other sizes. We obtain 143 and 285 GFlop/s for single precision real when processing matrices of size 10 and 16, respectively on NVIDIA Tesla K20c using CUDA 5.0. The corresponding speeds for 3-D array Kronecker products are 126 and 268 GFlop/s, respectively. Double precision is easily supported using the C++ template mechanism.

(Chetan Jhurani, “Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs”, submitted, April 2013. [preprint])

Fast GEMM for multiple small matrices on NVIDIA GPUs

April 9th, 2013


We present an interface and an implementation of the General Matrix Multiply (GEMM) routine for multiple small matrices processed simultaneously on NVIDIA graphics processing units (GPUs). We focus on matrix sizes under 16. The implementation can be easily extended to larger sizes. For single precision matrices, our implementation is 30% to 600% faster than the batched cuBLAS implementation distributed in the CUDA Toolkit 5.0 on NVIDIA Tesla K20c. For example, we obtain 104 GFlop/s and 216 GFlop/s when multiplying 100,000 independent matrix pairs of size 10 and 16, respectively. Similar improvement in performance is obtained for other sizes, in single and double precision for real and complex types, and when the number of matrices is smaller. Apart from our implementation, our different function interface also plays an important role in the improved performance. Applications of this software include Finite Element computation on GPUs.

(Chetan Jhurani and Paul Mullowney, “A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices”, submitted to Journal of Parallel and Distributed Computing, April 2013. [preprint])

Accelerating GPU Kernels for Dense Linear Algebra

November 14th, 2011


Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA) libraries, and therefore have to be highly optimized. We present some techniques and implementations that significantly accelerate the corresponding routines from currently available libraries for GPUs. In particular, Pointer Redirecting – a set of GPU specific optimization techniques –allows us to easily remove performance oscillations associated with problem dimensions not divisible by fixed blocking sizes. For example, applied to the matrix-matrix multiplication routines, depending on the hardware configuration and routine parameters, this can lead to two times faster algorithms. Similarly, the matrix-vector multiplication can be accelerated more than two times in both single and double precision arithmetic. Additionally, GPU specific acceleration techniques are applied to develop new kernels (e.g. syrk, symv) that are up to 20x faster than the currently available kernels. We present these kernels and also show their acceleration e!ect to higher level dense linear algebra routines. The accelerated kernels are now freely available through the MAGMA BLAS library.

(R. Nath, S. Tomov and J. Dongarra: “Accelerating GPU Kernels for Dense Linear Algebra”, VECPAR 2010. [PDF])

An Improved MAGMA GEMM For Fermi Graphics Processing Units

November 14th, 2011


We present an improved matrix–matrix multiplication routine (General Matrix Multiply [GEMM]) in the MAGMA BLAS library that targets the NVIDIA Fermi graphics processing units (GPUs) using Compute Unified Data Architecture (CUDA). We show how to modify the previous MAGMA GEMM kernels in order to make a more efficient use of the Fermi’s new architectural features, most notably their extended memory hierarchy and memory sizes. The improved kernels run at up to 300 GFlop/s in double precision and up to 645 GFlop/s in single precision arithmetic (on a C2050), which is correspondingly 58% and 63% of the theoretical peak. We compare the improved kernels with the currently available version in CUBLAS 3.1. Further, we show the effect of the new kernels on higher-level dense linear algebra (DLA) routines such as the one-sided matrix factorizations, and compare their performances with corresponding, currently available routines running on homogeneous multicore systems.

(R. Nath and S. Tomov and J. Dongarra: “An Improved MAGMA GEMM For Fermi Graphics Processing Units”,  International Journal of High Performance Computing Applications. 24(4), 511-515, 2010. [DOI] [PREPRINT])

Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs

August 19th, 2011


GPUs are excellent accelerators for data parallel applications with regular data access patterns. It is challenging, however, to optimize computations with irregular data access patterns on GPUs. One such computation is the Symmetric Matrix Vector product (SYMV) for dense linear algebra. Optimizing the SYMV kernel is important because it forms the basis of fundamental algorithms such as linear solvers and eigenvalue solvers on symmetric matrices. In this work, we present a new algorithm for optimizing the SYMV kernel on GPUs. Our optimized SYMV in single precision brings up to a 7x speed up compared to the (latest) CUBLAS 4.0 NVIDIA library on the GTX 280 GPU. Our SYMV kernel tuned for Fermi C2050 is 4.5x faster than CUBLAS 4.0 in single precision and 2x faster than CUBLAS 4.0 in double precision. Moreover, the techniques used and described in the paper are general enough to be of interest for developing high-performance GPU kernels beyond the particular case of SYMV.

(R. Nath, S. Tomov, T. Dong, and J. Dongarra, “Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs”, accepted for SC’11.  [WWW] [PDF])

MAGMA 1.0 – LAPACK for GPUs – has been released

December 14th, 2010

MAGMA 1.0 RC1 is now available, including the MAGMA sources. MAGMA 1.0 RC1 is intended for a single CUDA enabled NVIDIA GPU. It extends version 0.2 by adding support for Fermi GPUs (see the sample performances for LU, QR, and Cholesky).

Included are routines for the following algorithms:

  • LU, QR, and Cholesky factorizations in both real and complex arithmetic (single and double);
  • Linear solvers based on LU, QR, and Cholesky in both real and complex arithmetic (single and double);
  • Mixed-precision iterative refinement solvers based on LU, QR, and Cholesky in both real and complex arithmetic;
  • MAGMA BLAS in real arithmetic (single and double), including gemm, gemv, symv, and trsm.

See the MAGMA homepage for a download link.

Performance Analysis of a Hybrid MPI/CUDA Implementation of the NAS-LU Benchmark

November 16th, 2010


The emergence of Graphics Processing Units (GPUs) as a potential alternative to conventional general-purpose processors has led to significant interest in these architectures by both the academic community and the High Performance Computing (HPC) industry. While GPUs look likely to deliver unparalleled levels of performance, the publication of studies claiming performance improvements in excess of 30,000x are misleading. Significant on-node performance improvements have been demonstrated for code kernels and algorithms amenable to GPU acceleration; studies demonstrating comparable results for full scientific applications requiring multiple-GPU architectures are rare.

In this paper we present an analysis of a port of the NAS LU benchmark to NVIDIA’s Compute Unified Device Architecture (CUDA) – the most stable GPU programming model currently available. Our solution is also extended to multiple nodes and multiple GPU devices.

Runtime performance on several GPUs is presented, ranging from low-end, consumer-grade cards such as the 8400GS to NVIDIA’s flagship Fermi HPC processor found in the recently released C2050. We compare the runtimes of these devices to several processors including those from Intel, AMD and IBM.

In addition to this we utilise a recently developed performance model of LU. With this we predict the runtime performance of LU on large-scale distributed GPU clusters, which are predicted to become commonplace in future high-end HPC architectural solutions.

(S.J. Pennycook, S.D. Harmond, S.A. Jarvis and G.R. Mudalige: “Implementation of the NAS-LU Benchmark”, 1st International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10), held as part of Supercomputing 2010 (SC’10), New Orleans, LA, USA. [PDF])