January 11th, 2013
January 6th, 2013
We present MGPU, a C++ programming library targeted at single-node multi-GPU systems. Such systems combine disproportionate floating point performance with high data locality and are thus well suited to implement real-time algorithms. We describe the library design, programming interface and implementation details in light of this specific problem domain. The core concepts of this work are a novel kind of container abstraction and MPI-like communication methods for intra-system communication. We further demonstrate how MGPU is used as a framework for porting existing GPU libraries to multi-device architectures. Putting our library to the test, we accelerate an iterative non-linear image reconstruction algorithm for real-time magnetic resonance imaging using multiple GPUs. We achieve a speed-up of about 1.7 using 2 GPUs and reach a final speed-up of 2.1 with 4 GPUs. These promising results lead us to conclude that multi-GPU systems are a viable solution for real-time MRI reconstruction as well as signal-processing applications in general.
(Sebastian Schaetz and Martin Uecker: “A Multi-GPU Programming Library for Real-Time Applications”, Algorithms and Architectures for Parallel Processing (2012): 114-128. [DOI] [ARXIV])
January 6th, 2013
This paper presents an efficient technique for fast generation of sparse systems of linear equations arising in computational electromagnetics in a finite element method using higher order elements. The proposed approach employs a graphics processing unit (GPU) for both numerical integration and matrix assembly. The performance results obtained on a test platform consisting of a Fermi GPU (1x Tesla C2075) and a CPU (2x twelve-core Opterons), indicate that the GPU implementation of the matrix generation allows one to achieve speedups by a factor of 81 and 19 over the optimized single-and multi-threaded CPU-only implementations, respectively.
(Adam Dziekonski et al., “Finite Element Matrix Generation on a GPU”, Progress In Electromagnetics Research 128:249-265, 2012. [PDF])
December 21st, 2012
Spatial stochastic simulation is a valuable technique for studying reactions in biological systems. With the availability of high-performance computing (HPC), the method is poised to allow integration of data from structural, single-molecule and biochemical studies into coherent computational models of cells. Here, we introduce the Lattice Microbes software package for simulating such cell models on HPC systems. The software performs either well-stirred or spatially resolved stochastic simulations with approximated cytoplasmic crowding in a fast and efficient manner. Our new algorithm efficiently samples the reaction-diffusion master equation using NVIDIA graphics processing units and is shown to be two orders of magnitude faster than exact sampling for large systems while maintaining an accuracy of ∼0.1%. Display of cell models and animation of reaction trajectories involving millions of molecules is facilitated using a plug-in to the popular VMD visualization platform. The Lattice Microbes software is open source and available for download at http://www.scs.illinois.edu/schulten/lm
(Elijah Roberts, John E. Stone and Zaida Luthey-Schulten: “Lattice Microbes: High-Performance Stochastic Simulation Method for the Reaction-Diffusion Master Equation”, Journal of Computational Chemistry, 34:245-255, 2013. [DOI])
December 18th, 2012
amgcl is a simple and generic algebraic multigrid (AMG) hierarchy builder. Supported coarsening methods are classical Ruge-Stuben coarsening, and either plain or smoothed aggregation. The constructed hierarchy is stored and used with help of one of the supported backends including VexCL, ViennaCL, and CUSPARSE/Thrust.
With help of amgcl, solution of a large sparse system of linear equations may be easily accelerated through OpenCL, CUDA, or OpenMP technologies. Source code of the library is publicly available under MIT license at https://github.com/ddemidov/amgcl.
December 17th, 2012
rCUDA (remote CUDA) v4.0 has just been released. It provides full binary compatibility with CUDA applications (no need to modify the application source code or recompile your program), native InfiniBand support, enhanced data transfers, and CUDA 5.0 API support (excluding graphics interoperability). This new release of rCUDA allows to execute existing GPU-accelerated applications by leveraging remote GPUs within a cluster (both via sharing and/or aggregating GPUs) with a negligible overhead. The new version is available free of charge ar www.rCUDA.net, along with examples, manuals and additional information.
November 14th, 2012
Alea.cuBase allows to create GPU accelerated applications at all levels of sophistication, from simple GPU kernels up to complex GPU algorithms using textures, shared memory and other advanced GPU programming techniques, fully integrated into .NET. The GPU kernels are developed in functional language F# and are callable from any other .NET language. No additional wrappers or assembly translation processes are required. Alea.cuBase allows dynamic creation of GPU code at run time, thereby opening completely new dimensions for GPU accelerated applications. Trial versions are available at http://www.quantalea.net/products.
November 10th, 2012
SPECFEM3D is a widely used community code which simulates seismic wave propagation in earth-science applications. It can be run either on multi-core CPUs only or together with many-core GPU devices on large GPU clusters. The new implementation is optimally fine-tuned and achieves excellent performance results. Mesh coloring enables an efficient accumulation of border nodes in the assembly process over an unstructured mesh on the GPU and asynchronous GPU-CPU memory transfers and non-blocking MPI are used to overlap communication and computation, effectively hiding synchronizations. To demonstrate the performance of the inversion, we present two case studies run on the Cray XE6 and XK6 architectures up to 896 nodes: (1) focusing on most commonly used forward simulations, we simulate wave propagation generated by earthquakes in Turkey, and (2) testing the most complex simulation type of the package, we use ambient seismic noise to image 3D crust and mantle structure beneath western Europe.
(Max Rietmann, Peter Messmer, Tarje Nissen-Meyer, Daniel Peter, Piero Basini, Dimitri Komatitsch, Olaf Schenk, Jeroen Tromp, Lapo Boschi and Domenico Giardini, “Forward and Adjoint Simulations of Seismic Wave Propagation on Emerging Large-Scale GPU Architectures”, Proceedings of the 2012 ACM/IEEE conference on Supercomputing, Nov. 2012. [WWW])
November 8th, 2012
This paper addresses the design, implementation and validation of an effective scheduling scheme for both regular and irregular applications on heterogeneous platforms. The scheduler uses an empirical performance model to dynamically schedule the workload, organized into a given number of chunks, and follows the Heterogeneous Earliest Finish Time (HEFT) scheduling algorithm, which ranks the tasks based on both their computation and communication costs. The evaluation of the proposed approach is based on three case studies – the SAXPY, the FFT and the Barnes-Hut algorithms – two regular and one irregular application. The scheduler was evaluated on a heterogeneous platform with one quad-core CPU-chip accelerated by one or two GPU devices, embedded in the GAMA framework. The evaluation runs measured the effectiveness, the efficiency and the scalability of the proposed method. Results show that the proposed model was effective in addressing both regular and irregular applications, on heterogeneous platforms, while achieving ideal (>=100%) levels of efficiency in the irregular Barnes-Hut algorithm.
(Artur Mariano, Ricardo Alves, Joao Barbosa, Luis Paulo Santos and Alberto Proenca: “A (ir)regularity-aware task scheduler for heterogeneous platforms”, Proceedings of the 2nd International Conference on High Performance Computing, Kiev, October 2012, pp 45-56,. [PDF])
October 30th, 2012
We describe an extension of our graphics processing unit (GPU) electronic structure program TeraChem to include atom-centered Gaussian basis sets with d angular momentum functions. This was made possible by a “meta-programming” strategy that leverages computer algebra systems for the derivation of equations and their transformation to correct code. We generate a multitude of code fragments that are formally mathematically equivalent, but differ in their memory and floating-point operation footprints. We then select between different code fragments using empirical testing to find the highest performing code variant. This leads to an optimal balance of floating-point operations and memory bandwidth for a given target architecture without laborious manual tuning. We show that this approach is capable of similar performance compared to our hand-tuned GPU kernels for basis sets with s and p angular momenta. We also demonstrate that mixed precision schemes (using both single and double precision) remain stable and accurate for molecules with d functions. We provide benchmarks of the execution time of entire self-consistent field (SCF) calculations using our GPU code and compare to mature CPU based codes, showing the benefits of the GPU architecture for electronic structure theory with appropriately redesigned algorithms. We suggest that the meta-programming and empirical performance optimization approach may be important in future computational chemistry applications, especially in the face of quickly evolving computer architectures.
(Alexey V Titov , Ivan S. Ufimtsev , Nathan Luehr and Todd J. Martínez: “Generating Efficient Quantum Chemistry Codes for Novel Architectures”, accepted for publication in the Journal of Chemical Theory and Computation, 2012. [DOI])
We present new format for storing sparse matrices on GPU. We compare it with several other formats including CUSPARSE which is today probably the best choice for processing of sparse matrices on GPU in CUDA. Contrary to CUSPARSE which works with common CSR format, our new format requires conversion. However, multiplication of sparse-matrix and vector is significantly faster for many matrices. We demonstrate it on set of 1 600 matrices and we show for what types of matrices our format is profitable.
(Heller M., Oberhuber T.: “Improved Row-grouped CSR Format for Storing of Sparse Matrices on GPU”, Proceedings of Algoritmy 2012, 2012, Handlovičová A., Minarechová Z. and Ševčovič D. (ed.), pages 282-290, ISBN 978-80-227-3742-5) [ARXIV preprint]