Policy-based Tuning for Performance Portability and Library Co-optimization

July 22nd, 2012


Although modular programming is a fundamental software development practice, software reuse within contemporary GPU kernels is uncommon. For GPU software assets to be reusable across problem instances, they must be inherently flexible and tunable. To illustrate, we survey the performance-portability landscape for a suite of common GPU primitives, evaluating thousands of reasonable program variants across a large diversity of problem instances (microarchitecture, problem size, and data type). While individual specializations provide excellent performance for specific instances, we find no variants with universally reasonable performance. In this paper, we present a policy-based design idiom for constructing reusable, tunable software components that can be co-optimized with the enclosing kernel for the specific problem and processor at hand. In particular, this approach enables flexible granularity coarsening which allows the expensive aspects of communication and the redundant aspects of data parallelism to scale with the width of the processor rather than the problem size. From a small library of tunable device subroutines, we have constructed the fastest, most versatile GPU primitives for reduction, prefix and segmented scan, duplicate removal, reduction-by-key, sorting, and sparse graph traversal.

(Duane Merrill, Michael Garland and Andrew Grimshaw, “Policy-based Tuning for Performance Portability and Library Co-optimization”, Innovative Parallel Computing 2012. [WWW])

Optimization of a Broadband Discone Antenna Design and Platform Installed Radiation Patterns Using a GPU-Accelerated Savant/WIPL-D Hybrid Approach

July 20th, 2012


Traditional design guidelines for broadband antennas do not always produce satisfactory performance for the desired frequency range of interest. In addition, the accurate prediction of the free-space antenna performance is not sufficient to determine if the antenna will meet a larger system requirement because the performance of the antenna can change significantly when it is installed on a platform. Antenna design software, such as WIPL-D, addresses the difficulties of designing antennas with broadband performance by providing optimization software that can automatically resize the various antenna dimensions until a desired performance criterion is met. At high-frequencies, the electrically large size of the platform makes it computationally difficult, or impossible, to directly consider the interactions between the antenna and the platform when designing the antenna in a full-wave solver. This paper describes an approach for the design and optimization of a discone antenna and then the subsequent installation on a large commercial aircraft. The antenna design will be optimized across a wide frequency range using WIPL-D Optimizer. The resulting discone antenna design is then imported into Savant-Hybrid, a hybrid asymptotic and full-wave solver, and the installed antenna performance is simulated using GPU acceleration at multiple potential antenna locations to determine the location that provides the least-degraded installed antenna performance.

(Tod Courtney, Matthew C. Miller, John E. Stone, and Robert A. Kipp: “Optimization of a Broadband Discone Antenna Design and Platform Installed Radiation Patterns Using a GPU-Accelerated Savant/WIPL-D Hybrid Approach”, Proceedings of the Applied Computational Electromagnetics Symposium
(ACES 2012), Columbus, Ohio, April 2012. [PDF])

Implementing a code generator for fast matrix multiplication in OpenCL on the GPU

July 11th, 2012


This paper presents results of an implementation of code generator for fast general matrix multiply (GEMM) kernels. When a set of parameters is given, the code generator produces the corresponding GEMM kernel written in OpenCL. The produced kernels are optimized for high-performance implementation on GPUs from AMD. Access latencies to GPU global memory is the main drawback for high performance. This study shows that storing matrix data in a block-major layout increases the performance and stability of GEMM kernels. On the Tahiti GPU (Radeon HD 7970), our DGEMM (double-precision GEMM) and SGEMM (single-precision GEMM) kernels achieve the performance up to 848 GFlop/s (90% of the peak) and 2646 GFlop/s (70%), respectively.

(K. Matsumoto, N. Nakasato, S. G. Sedukhin: “Implementing a code generator for fast matrix multiplication in OpenCL on the GPU”, accepted for Special Session: Auto-Tuning for Multicore and GPU (ATMG), IEEE 6th International Symposium on Embedded Multicore SoCs (MCSoC-12), Sep. 2012. [PDF])

A Study of Persistent Threads Style GPU Programming for GPGPU Workloads

July 4th, 2012


In this paper, we characterize and analyze an increasingly popular style of programming for the GPU called Persistent Threads (PT). We present a concise formal definition for this programming style, and discuss the difference between the traditional GPU programming style (nonPT) and PT, why PT is attractive for some high-performance usage scenarios, and when using PT may or may not be appropriate. We identify limitations of the nonPT style and identify four primary use cases it could be useful in addressing— CPU-GPU synchronization, load balancing/irregular parallelism, producer-consumer locality, and global synchronization. Through micro-kernel benchmarks we show the PT approach can achieve up to an order-of-magnitude speedup over nonPT kernels, but can also result in performance loss in many cases. We conclude by discussing the hardware and software fundamentals that will influence the development of Persistent Threads as a programming style in future systems.

(Kshitij Gupta, Jeff A. Stuart and John D. Owens: “A Study of Persistent Threads Style GPU Programming for GPGPU Workloads”, Proceedings of Innovative Parallel Computing, May 2012. [WWW])

Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach

June 22nd, 2012


A wide range of applications in engineering and scientific computing are involved in the acceleration of the sparse matrix vector product (SpMV). Graphics Processing Units (GPUs) have recently emerged as platforms that yield outstanding acceleration factors. SpMV implementations for GPUs have already appeared on the scene. This work is focused on the ELLR-T algorithm to compute SpMV on GPU architecture, its performance is strongly dependent on the optimum selection of two parameters. Therefore, taking account that the memory operations dominate the performance of ELLR-T, an analytical model is proposed in order to obtain the auto-tuning of ELLR-T for particular combinations of sparse matrix and GPU architecture. The evaluation results with a representative set of test matrices show that the average performance achieved by auto-tuned ELLR-T by means of the proposed model is near to the optimum. A comparative analysis of ELLR-T against a variety of previous proposals shows that ELLR-T with the estimated configuration reaches the best performance on GPU architecture for the representative set of test matrices.

(Francisco Vázquez and José Jesús Fernández and Ester M. Garzón: “Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach”, Parallel Computing 38(8), 408-420, Aug. 2012. [DOI])

Frequent Items Mining Acceleration Exploiting Fast Parallel Sorting on the GPU

June 20th, 2012


In this paper, we show how to employ Graphics Processing Units (GPUs) to provide an efficient and high performance solution for finding frequent items in data streams. We discuss several design alternatives and present an implementation that exploits the great capability of graphics processors in parallel sorting. We provide an exhaustive evaluation of performances, quality results and several design trade-offs. On an off-the-shelf GPU, the fastest of our implementations can process over 200 million items per second, which is better than the best known solution based on Field Programmable Gate Arrays (FPGAs) and CPUs. Moreover, in previous approaches, performances are directly related to the skewness of the input data distribution, while in our approach, the high throughput is independent from this factor.

(Ugo Erra, Bernardino Frola: “Frequent Items Mining Acceleration Exploiting Fast Parallel Sorting on the GPU”, Procedia Computer Science 9, pp 86-95 (Proceedings of the International Conference on Computational Science), 2012. [DOI])

GPU Accelerated Multi-agent Path Planning Based on Grid Space Decomposition

June 20th, 2012


In this work, we describe a simple and powerful method to implement real-time multi-agent path-finding on Graphics Processor Units (GPUs). The technique aims to find potential paths for many thousands of agents, using the A* algorithm and an input grid map partitioned into blocks. We propose an implementation for the GPU that uses a search space decomposition approach to break down the forward search A* algorithm into parallel independently forward sub-searches. We show that this approach fits well with the programming model of GPUs, enabling planning for many thousands of agents in parallel in real-time applications such as computer games and robotics. The paper describes this implementation using the Compute Unified Device Architecture programming environment, and demonstrates its advantages in GPU performance compared to GPU implementation of Real-Time Adaptive A*.

(Giuseppe Caggianese , Ugo Erra: “GPU Accelerated Multi-agent Path Planning Based on Grid Space Decomposition”, Procedia Computer Science 9, pp 1847-1856 (Proceedings of the International Conference on Computational Science), 2012. [DOI])

Parallel Incomplete-LU and Cholesky Factorization

June 13th, 2012


A novel algorithm for computing the incomplete-LU and Cholesky factorization with 0 fill-in on a graphics processing unit (GPU) is proposed. It implements the incomplete factorization of the given matrix in two phases. First, the symbolic analysis phase builds a dependency graph based on the matrix sparsity pattern and groups the independent rows into levels. Second, the numerical factorization phase obtains the resulting lower and upper sparse triangular factors by iterating sequentially across the constructed levels. The Gaussian elimination of the elements below the main diagonal in the rows corresponding to each single level is performed in parallel. The numerical experiments are also presented and it is shown that the numerical factorization phase can achieve on average more than 2.8x speedup over MKL, while the incomplete-LU and Cholesky preconditioned iterative methods can achieve an average of 2x speedup on GPU over their CPU implementation.

(Maxim Naumov. Parallel Incomplete-LU and Cholesky Factorization in the Preconditioned Iterative Methods on the GPU, NVIDIA Technical Report NVR-2012-003. May 2012.)

Relational Algorithms for Multi-Bulk-Synchronous Processors

June 6th, 2012

This publication describes efficient low level algorithms for performing relational queries on parallel processors, such as NVIDIA Fermi or Kepler. Relations are stored in GPU memory as sorted arrays of tuples, and manipulated by relational operators that preserve the sorted property. Most significantly, this work introduces algorithms for JOIN and SET INTERSECTION/UNION/DIFFERENCE that can process data at over 50 GB/s.


Relational databases remain an important application domain for organizing and analyzing the massive volume of data generated as sensor technology, retail and inventory transactions, social media, computer vision, and new fields continue to evolve. At the same time, processor architectures are beginning to shift towards hierarchical and parallel architectures employing throughput-optimized memory systems, lightweight multi-threading, and Single-Instruction Multiple-Data (SIMD) core organizations. Examples include general purpose graphics processing units (GPUs) such as NVIDIA’s Fermi, Intels Sandy Bridge, and AMD’s Fusion processors. This paper explores the mapping of primitive relational algebra operations onto GPUs. In particular, we focus on algorithms and data structure design identifying a fundamental conflict between the structure of algorithms with good computational complexity and that of algorithms with memory access patterns and instruction schedules that achieve peak machine utilization. To reconcile this conflict, our design space exploration converges on a hybrid multi-stage algorithm that devotes a small amount of the total runtime to prune input data sets using an irregular algorithm with good computational complexity. The partial results are then fed into a regular algorithm that achieves near peak machine utilization. The design process leading to the most efficient algorithm for each stage is described, detailing alternative implementations, their performance characteristics, and an explanation of why they were ultimately abandoned. The least efficient algorithm (JOIN) achieves 57% − 72% of peak machine performance depending on the density of the input. The most efficient algorithms (PRODUCT, PROJECT, and SELECT) achieve 86% − 92% of peak machine performance across all input data sets. To the best of our knowledge, these represent the best known published results to date for any implementations. This work lays the foundation for the development of a relational database system that achieves good scalability on a Multi-Bulk-Synchronous-Parallel (M-BSP) processor architecture. Additionally, the algorithm design may offer insights into the design of parallel and distributed relational database systems. It leaves the problems of query planning, operator→query synthesis, corner case optimization, and system/OS interaction as future work that would be necessary for commercial deployment.

(Gregory Diamos, Ashwin Lele, Jin Wang, Sudhakar Yalamanchili: “Relational Algorithms for Multi-Bulk-Synchronous Processors “, NVIDIA Tech Report, June 2012. [WWW]) 

CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform

May 11th, 2012


Motivation: New high-throughput sequencing technologies have promoted the production of short reads with dramatically low unit cost. The explosive growth of short read datasets poses a challenge to the mapping of short reads to reference genomes, such as the human genome, in terms of alignment quality and execution speed.

Results: We present CUSHAW, a parallelized short read aligner based on the compute unified device architecture (CUDA) parallel programming model. We exploit CUDA-compatible graphics hardware as accelerators to achieve fast speed. Our algorithm employs a quality-aware bounded search approach based on the Burrows- Wheeler transform (BWT) and the Ferragina Manzini (FM)-index to reduce the search space and achieve high alignment quality. Performance evaluation, using simulated as well as real short read datasets, reveals that our algorithm running on one or two graphics processing units (GPUs) achieves significant speedups in terms of execution time, while yielding comparable or even better alignment quality for paired-end alignments compared to three popular BWT-based aligners: Bowtie, BWA and SOAP2. CUSHAW also delivers competitive performance in terms of SNP calling for an E.coli test dataset.

Availability: http://cushaw.sourceforge.net.

(Y. Liu, B. Schmidt, D. Maskell: “CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform”, Bioinformatics, 2012. [DOI])

Page 6 of 38« First...45678...2030...Last »