January 6th, 2013
December 3rd, 2012
Spatial stochastic simulation is a valuable technique for studying reactions in biological systems. With the availability of high-performance computing (HPC), the method is poised to allow integration of data from structural, single-molecule and biochemical studies into coherent computational models of cells. Here, we introduce the Lattice Microbes software package for simulating such cell models on HPC systems. The software performs either well-stirred or spatially resolved stochastic simulations with approximated cytoplasmic crowding in a fast and efficient manner. Our new algorithm efficiently samples the reaction-diffusion master equation using NVIDIA graphics processing units and is shown to be two orders of magnitude faster than exact sampling for large systems while maintaining an accuracy of ∼0.1%. Display of cell models and animation of reaction trajectories involving millions of molecules is facilitated using a plug-in to the popular VMD visualization platform. The Lattice Microbes software is open source and available for download at http://www.scs.illinois.edu/schulten/lm
(Elijah Roberts, John E. Stone and Zaida Luthey-Schulten: “Lattice Microbes: High-Performance Stochastic Simulation Method for the Reaction-Diffusion Master Equation”, Journal of Computational Chemistry, 34:245-255, 2013. [DOI])
November 14th, 2012
Nonbinary Low-Density Parity-Check (LDPC) codes are a class of error-correcting codes constructed over the Galois field GF(q) for q > 2. As extensions of binary LDPC codes, nonbinary LDPC codes can provide better error-correcting performance when the code length is short or moderate, but at a cost of higher decoding complexity. This paper proposes a massively parallel implementation of a nonbinary LDPC decoding accelerator based on a graphics processing unit (GPU) to achieve both great flexibility and scalability. The implementation maps the Min-Max decoding algorithm to GPU’s massively parallel architecture. We highlight the methodology to partition the decoding task to a heterogeneous platform consisting of the CPU and GPU. The experimental results show that our GPUbased implementation can achieve high throughput while still providing great flexibility and scalability.
(Guohui Wang, Hao Shen, Bei Yin, Michael Wu, Yang Sun, and Joseph R. Cavallaro: “Parallel Nonbinary LDPC Decoding on GPU”, 46th Asilomar Conference on Signals, Systems, and Computers (ASILOMAR), Nov. 4-7, 2012. [PDF])
November 10th, 2012
SPECFEM3D is a widely used community code which simulates seismic wave propagation in earth-science applications. It can be run either on multi-core CPUs only or together with many-core GPU devices on large GPU clusters. The new implementation is optimally fine-tuned and achieves excellent performance results. Mesh coloring enables an efficient accumulation of border nodes in the assembly process over an unstructured mesh on the GPU and asynchronous GPU-CPU memory transfers and non-blocking MPI are used to overlap communication and computation, effectively hiding synchronizations. To demonstrate the performance of the inversion, we present two case studies run on the Cray XE6 and XK6 architectures up to 896 nodes: (1) focusing on most commonly used forward simulations, we simulate wave propagation generated by earthquakes in Turkey, and (2) testing the most complex simulation type of the package, we use ambient seismic noise to image 3D crust and mantle structure beneath western Europe.
(Max Rietmann, Peter Messmer, Tarje Nissen-Meyer, Daniel Peter, Piero Basini, Dimitri Komatitsch, Olaf Schenk, Jeroen Tromp, Lapo Boschi and Domenico Giardini, “Forward and Adjoint Simulations of Seismic Wave Propagation on Emerging Large-Scale GPU Architectures”, Proceedings of the 2012 ACM/IEEE conference on Supercomputing, Nov. 2012. [WWW])
November 8th, 2012
This paper addresses the design, implementation and validation of an effective scheduling scheme for both regular and irregular applications on heterogeneous platforms. The scheduler uses an empirical performance model to dynamically schedule the workload, organized into a given number of chunks, and follows the Heterogeneous Earliest Finish Time (HEFT) scheduling algorithm, which ranks the tasks based on both their computation and communication costs. The evaluation of the proposed approach is based on three case studies – the SAXPY, the FFT and the Barnes-Hut algorithms – two regular and one irregular application. The scheduler was evaluated on a heterogeneous platform with one quad-core CPU-chip accelerated by one or two GPU devices, embedded in the GAMA framework. The evaluation runs measured the effectiveness, the efficiency and the scalability of the proposed method. Results show that the proposed model was effective in addressing both regular and irregular applications, on heterogeneous platforms, while achieving ideal (>=100%) levels of efficiency in the irregular Barnes-Hut algorithm.
(Artur Mariano, Ricardo Alves, Joao Barbosa, Luis Paulo Santos and Alberto Proenca: “A (ir)regularity-aware task scheduler for heterogeneous platforms”, Proceedings of the 2nd International Conference on High Performance Computing, Kiev, October 2012, pp 45-56,. [PDF])
October 30th, 2012
We describe an extension of our graphics processing unit (GPU) electronic structure program TeraChem to include atom-centered Gaussian basis sets with d angular momentum functions. This was made possible by a “meta-programming” strategy that leverages computer algebra systems for the derivation of equations and their transformation to correct code. We generate a multitude of code fragments that are formally mathematically equivalent, but differ in their memory and floating-point operation footprints. We then select between different code fragments using empirical testing to find the highest performing code variant. This leads to an optimal balance of floating-point operations and memory bandwidth for a given target architecture without laborious manual tuning. We show that this approach is capable of similar performance compared to our hand-tuned GPU kernels for basis sets with s and p angular momenta. We also demonstrate that mixed precision schemes (using both single and double precision) remain stable and accurate for molecules with d functions. We provide benchmarks of the execution time of entire self-consistent field (SCF) calculations using our GPU code and compare to mature CPU based codes, showing the benefits of the GPU architecture for electronic structure theory with appropriately redesigned algorithms. We suggest that the meta-programming and empirical performance optimization approach may be important in future computational chemistry applications, especially in the face of quickly evolving computer architectures.
(Alexey V Titov , Ivan S. Ufimtsev , Nathan Luehr and Todd J. Martínez: “Generating Efficient Quantum Chemistry Codes for Novel Architectures”, accepted for publication in the Journal of Chemical Theory and Computation, 2012. [DOI])
October 24th, 2012
We present new format for storing sparse matrices on GPU. We compare it with several other formats including CUSPARSE which is today probably the best choice for processing of sparse matrices on GPU in CUDA. Contrary to CUSPARSE which works with common CSR format, our new format requires conversion. However, multiplication of sparse-matrix and vector is significantly faster for many matrices. We demonstrate it on set of 1 600 matrices and we show for what types of matrices our format is profitable.
(Heller M., Oberhuber T.: “Improved Row-grouped CSR Format for Storing of Sparse Matrices on GPU”, Proceedings of Algoritmy 2012, 2012, Handlovičová A., Minarechová Z. and Ševčovič D. (ed.), pages 282-290, ISBN 978-80-227-3742-5) [ARXIV preprint]
October 22nd, 2012
This article proposes to address, in a tutorial style, the benefits of using Open Computing Language (OpenCL) as a quick way to allow programmers to express and exploit parallelism in signal processing algorithms, such as those used in error-correcting code systems. In particular, we will show how multiplatform kernels can be developed straightforwardly using OpenCL to perform computationally intensive low-density parity-check (LDPC) decoding, targeting them to run on a large set of worldwide disseminated multicore architectures, such as x86 general- purpose multicore central processing units (CPUs) and graphics processing units (GPUs). Moreover, devices with different architectures can be orchestrated to cooperatively execute these signal processing applications programmed in OpenCL. Experimental evaluation of the parallel kernels programmed with the OpenCL framework shows that high-performance can be achieved for distinct parallel computing architectures with low programming effort.
The complete source code developed and instructions for compiling and executing the program are available at http://www.co.it.pt/ldpcopencl for signal processing programmers who wish to engage with more advanced features supported by OpenCL.
(G. Falcao, V. Silva, L. Sousa and J. Andrade: “Portable LDPC Decoding on Multicores Using OpenCL [Applications Corner]”, IEEE Signal Processing Magazine 29:4(81-109), July 2012. [DOI])
September 4th, 2012
Accelerating numerical algorithms for solving sparse linear systems on parallel architectures has attracted the attention of many researchers due to their applicability to many engineering and scientific problems. The solution of sparse systems often dominates the overall execution time of such problems and is mainly solved by iterative methods. Preconditioners are used to accelerate the convergence rate of these solvers and reduce the total execution time. Sparse Approximate Inverse (SAI) preconditioners are a popular class of preconditioners designed to improve the condition number of large sparse matrices and accelerate the convergence rate of iterative solvers for sparse linear systems. We propose a GPU accelerated SAI preconditioning technique called GSAI, which parallelizes the computation of this preconditioner on NVIDIA graphic cards. The preconditioner is then used to enhance the convergence rate of the BiConjugate Gradient Stabilized (BiCGStab) iterative solver on the GPU. The SAI preconditioner is generated on average 28 and 23 times faster on the NVIDIA GTX480 and TESLA M2070 graphic cards respectively compared to ParaSails (a popular implementation of SAI preconditioners on CPU) single processor/core results. The proposed GSAI technique computes the SAI preconditioner in approximately the same time as ParaSails generates the same preconditioner on 16 AMD Opteron 252 processors.
(Maryam Mehri Dehnavi, David Fernandez, Jean-Luc Gaudiot and Dennis Giannacopoulos: “Parallel Sparse Approximate Inverse Preconditioning on Graphic Processing Units”, IEEE Transactions on Parallel and Distributed Systems (to appear). [DOI])
August 1st, 2012
In this work, we evaluate OpenCL as aprogramming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has required performance-impacting initializations that do not exist in other languages such as CUDA. Understanding these implications allows us to provide a single library with decent performance on a variety of platforms. We choose triangular solver (TRSM) and matrix multiplication (GEMM) as representative level 3 BLAS routines to implement in OpenCL. We profile TRSM to get the time distribution of the OpenCL runtime system. We then provide tuned GEMM kernels for both the NVIDIA Tesla C2050 and ATI Radeon 5870, the latest GPUs offered by both companies. We explore the benefits of using the texture cache, the performance ramifications of copying data into images, discrepancies in the OpenCL and CUDA compilers’ optimizations, and other issues that affect the performance. Experimental results show that nearly 50% of peak performance can be obtained in GEMM on both GPUs in OpenCL. We also show that the performance of these kernels is not highly portable. Finally, we propose the use of auto-tuning to better explore these kernels’ parameter space using search harness.
(Peng Du, Rick Weber, Piotr Luszczek, Stanimire Tomov, Gregory Peterson, Jack Dongarra, “From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming”, Parallel Computing 38(8):391–407, Aug. 2012. [DOI] [early techreport])
We present an efficient algorithm for computation of surface representations enabling interactive visualization of large dynamic particle data sets. Our method is based on a GPU-accelerated data-parallel algorithm for computing a volumetric density map from Gaussian weighted particles. The algorithm extracts an isovalue surface from the computed density map, using fast GPU-accelerated Marching Cubes. This approach enables interactive frame rates for molecular dynamics simulations consisting of millions of atoms. The user can interactively adjust the display of structural detail on a continuous scale, ranging from atomic detail for in-depth analysis, to reduced detail visual representations suitable for viewing the overall architecture of molecular complexes. The extracted surface is useful for interactive visualization, and provides a basis for structure analysis methods.
(Michael Krone, John E. Stone, Thomas Ertl, and Klaus Schulten, “Fast visualization of Gaussian density surfaces for molecular dynamics and particle system trajectories”, In EuroVis – Short Papers 2012, pp. 67-71, 2012. [WWW])