Extending MPI to Accelerators

October 19th, 2011

A paper detailing several possible avenues to expand MPI to accelerators has just been presented at “Architectures and System for Big Data (ASBD) 2011″, a workshop at PACT 2011. The abstract and a link to the paper are both below. We (the authors) are looking for feedback as to which options seem attractive to GPU programmers and developers. We welcome any comments/thoughts/critiques you might have.

Current trends in computing and system architecture point towards a need for accelerators such as GPUs to have inherent communication capabilities. We review previous and current software libraries that provide pseudo-communication abilities through direct message passing. We show how these libraries are beneficial to the HPC community, but are not forward-thinking enough. We give motivation as to why MPI should be extended to support these accelerators, and provide a road map of achievable milestones to complete such an extension, some of which require advances in hardware and device drivers.

(Jeff A. Stuart, Pavan Balaji and John D. Owens, “Extending MPI to Accelerators”, PACT 2011 Workshop Series: Architectures and Systems for Big Data, October 2011. [WWW])

Sequence Homology Search using Fine-Grained Cycle Sharing of Idle GPUs

October 2nd, 2011

Abstract:

In this paper, we propose a fine-grained cycle sharing (FGCS) system capable of exploiting idle graphics processing units (GPUs) for accelerating sequence homology search in local area network environments. Our system exploits short idle periods on GPUs by running small parts of guest programs such that each part can be completed within hundreds of milliseconds. To detect such short idle periods from the pool of registered resources, our system continuously monitors keyboard and mouse activities via event handlers rather than waiting for a screensaver, as is typically deployed in existing systems. Our system also divides guest tasks into small parts according to a performance model that estimates execution times of the parts. This task division strategy minimizes any disruption to the owners of the GPU resources. Experimental results show that our FGCS system running on two non-dedicated GPUs achieves 111-116% of the throughput achieved by a single dedicated GPU. Furthermore, our system provides over two times the throughput of a screensaver-based system. We also show that the idle periods detected by our system constitute half of the system uptime. We believe that the GPUs hidden and often unused in office environments provide a powerful solution to sequence homology search.

(Fumihiko Ino, Yuma Munekawa, and Kenichi Hagihara, “Sequence Homology Search using Fine-Grained Cycle Sharing of Idle GPUs”, accepted for publication in IEEE Transactions on Parallel and Distributed Systems, Sep. 2011. [DOI])

Thrust: A Productivity-Oriented Library for CUDA

September 12th, 2011

Abstract:

This chapter demonstrates how to leverage the Thrust parallel template library to implement high-performance applications with minimal programming effort. Based on the C++ Standard Template Library (STL), Thrust brings a familiar high-level interface to the realm of GPU Computing while remaining fully interoperable with the rest of the CUDA software ecosystem. Applications written with Thrust are concise, readable, and efficient.

(Nathan Bell and Jared Hoberock: “Thrust: A Productivity-Oriented Library for CUDA”, GPU Computing Gems, Jade Edition, edited by Wen-mei W. Hwu, October 2011)

Non negative least squares on GPU/multicore architectures

September 4th, 2011

Abstract:

We parallelize a version of the active-set iterative algorithm derived from the original works of Lawson and Hanson (1974) on multi-core architectures. This algorithm requires the solution of an unconstrained least squares problem in every step of the iteration for a matrix composed of the passive columns of the original system matrix. To achieve improved performance, we use parallelizable procedures to efficiently update and {\em downdate} the QR factorization of the matrix at each iteration, to account for inserted and removed columns. We use a reordering strategy of the columns in the decomposition to reduce computation and memory access costs. We consider graphics processing units (GPUs) as a new mode for efficient parallel computations and compare our implementations to that of multi-core CPUs. Both synthetic and non-synthetic data are used in the experiments.

(Yuancheng Luo and Ramani Duraiswami, “Efficient Parallel Non-Negative Least Squares on Multicore Architectures”, SIAM Journal on Scientific Computing, accepted, Sep. 2011. [PDF] [Source code])

GPU Implementation of a Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method

September 2nd, 2011

Abstract:

A Helmholtz equation in two dimensions discretized by a second order finite difference scheme is considered. Krylov methods such as Bi-CGSTAB and IDR(s) have been chosen as solvers. Since the convergence of the Krylov solvers deteriorates with increasing wave number, a shifted Laplace multigrid preconditioner is used to improve the convergence. The implementation of the preconditioned solver on CPU (Central Processing Unit) is compared to an implementation on GPU (Graphics Processing Units or graphics card) using CUDA (Compute Unified Device Architecture). The results show that preconditioned Bi-CGSTAB on GPU as well as preconditioned IDR(s) on GPU is about 30 times faster than on CPU for the same stopping criterion.

(H. Knibbe, C.W. Oosterlee and C. Vuik, “GPU implementation of a Helmholtz Krylov solver preconditioned by a shifted Laplace multigrid method”, accepted for publication in the Journal of Computational and Applied Mathematics, 2011. [DOI])

Fast Hough Transform on GPUs: Exploration of Algorithm Trade-offs

August 29th, 2011

Abstract:

The Hough transform is a commonly used algorithm to detect lines and other features in images. It is robust to noise and occlusion, but has a large computational cost. This paper introduces two new implementations of the Hough transform for lines on a GPU. One focuses on minimizing processing time, while the other has an input-data independent processing time. Our results show that optimizing the GPU code for speed can achieve a speed-up over naive GPU code of about 10x. The implementation which focuses on processing speed is the faster one for most images, but the implementation which achieves a constant processing time is quicker for about 20% of the images.

(Gert-Jan van den Braak, Cedric Nugteren, Bart Mesman and Henk Corporaal: “Fast Hough Transform on GPUs: Exploration of Algorithm Trade-offs”. In: Advanced Concepts for Intelligent Vision Systems, Lecture Notes in Computer Science, Vol. 6915, pp.611-622, 2011. [DOI])

HPG11 papers available

August 27th, 2011

All papers and presentations from High Performance Graphics 2011 are now available online, including the keynote presentations and the Hot3D track.

HPG11 was held in Vancouver earlier this month.

Rigid body constraints realized in massively-parallel molecular dynamics on graphics processing units

August 20th, 2011

Abstract:

Molecular dynamics (MD) methods compute the trajectory of a system of point particles in response to a potential function by numerically integrating Newton’s equations of motion. Extending these basic methods with rigid body constraints enables composite particles with complex shapes such as anisotropic nanoparticles, grains, molecules, and rigid proteins to be modeled. Rigid body constraints are added to the GPU-accelerated MD package, HOOMD-blue, version 0.10.0. The software can now simulate systems of particles, rigid bodies, or mixed systems in microcanonical (NVE), canonical (NVT), and isothermalisobaric (NPT) ensembles. It can also apply the FIRE energy minimization technique to these systems. In this paper, we detail the massively parallel scheme that implements these algorithms and discuss how our design is tuned for the maximum possible performance. Two different case studies are included to demonstrate the performance attained, patchy spheres and tethered nanorods. In typical cases, HOOMD-blue on a single GTX 480 executes 2.5–3.6 times faster than LAMMPS executing the same simulation on any number of CPU cores in parallel. Simulations with rigid bodies may now be run with larger systems and for longer time scales on a single workstation than was previously even possible on large clusters.

(Trung Dac Nguyen, Carolyn L. Phillips, Joshua A. Anderson, and Sharon C. Glotzer: “Rigid body constraints realized in massively-parallel molecular dynamics on graphics processing units”, Computer Physics Communications 182(11):2307–2313, November 2011. [DOI])

Pseudo-random number generation for Brownian Dynamics and Dissipative Particle Dynamics simulations on GPU devices

August 20th, 2011

Abstract:

Brownian Dynamics (BD), also known as Langevin Dynamics, and Dissipative Particle Dynamics (DPD) are implicit solvent methods commonly used in models of soft matter and biomolecular systems. The interaction of the numerous solvent particles with larger particles is coarse-grained as a Langevin thermostat is applied to individual particles or to particle pairs. The Langevin thermostat requires a pseudo-random number generator (PRNG) to generate the stochastic force applied to each particle or pair of neighboring particles during each time step in the integration of Newton’s equations of motion. In a Single-Instruction-Multiple-Thread (SIMT) GPU parallel computing environment, small batches of random numbers must be generated over thousands of threads and millions of kernel calls. In this communication we introduce a one-PRNG-per-kernel-call-per-thread scheme, in which a micro-stream of pseudorandom numbers is generated in each thread and kernel call. These high quality, statistically robust micro-streams require no global memory for state storage, are more computationally efficient than other PRNG schemes in memory-bound kernels, and uniquely enable the DPD simulation method without requiring communication between threads.

(Carolyn L. Phillips, Joshua A. Anderson and Sharon C. Glotzer: “Dynamics and Dissipative Particle Dynamics simulations on GPU devices”, Journal of Computational Physics 230(19):7191-7201, August 2011. [DOI])

Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs

August 19th, 2011

Abstract:

GPUs are excellent accelerators for data parallel applications with regular data access patterns. It is challenging, however, to optimize computations with irregular data access patterns on GPUs. One such computation is the Symmetric Matrix Vector product (SYMV) for dense linear algebra. Optimizing the SYMV kernel is important because it forms the basis of fundamental algorithms such as linear solvers and eigenvalue solvers on symmetric matrices. In this work, we present a new algorithm for optimizing the SYMV kernel on GPUs. Our optimized SYMV in single precision brings up to a 7x speed up compared to the (latest) CUBLAS 4.0 NVIDIA library on the GTX 280 GPU. Our SYMV kernel tuned for Fermi C2050 is 4.5x faster than CUBLAS 4.0 in single precision and 2x faster than CUBLAS 4.0 in double precision. Moreover, the techniques used and described in the paper are general enough to be of interest for developing high-performance GPU kernels beyond the particular case of SYMV.

(R. Nath, S. Tomov, T. Dong, and J. Dongarra, “Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs”, accepted for SC’11.  [WWW] [PDF])

Page 2 of 3112345...102030...Last »