Algebraic multigrid methods for large, sparse linear systems are a necessity in many computational simulations, yet parallel algorithms for such solvers are generally decomposed into coarse-grained tasks suitable for distributed computers with traditional processing cores. However, accelerating multigrid on massively parallel throughput-oriented processors, such as the GPU, demands algorithms with abundant fine-grained parallelism. In this paper, we develop a parallel algebraic multigrid method which exposes substantial fine-grained parallelism in both the construction of the multigrid hierarchy as well as the cycling or solve stage. Our algorithms are expressed in terms of scalable parallel primitives that are efficiently implemented on the GPU. The resulting solver achieves an average speedup of over 2x in the setup phase and around 6x in the cycling phase when compared to a representative CPU implementation.
(Nathan Bell, Steven Dalton and Luke Olson: “Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods”, NVIDIA Technical Report NVR-2011-002, June 2011 [PDF and Sources])
Multigrid methods are efficient and fast solvers for problems typically modeled by partial differential equations of elliptic type. For problems with complex geometries and local singularities stencil-type discrete operators on equidistant Cartesian grids need to be replaced by more flexible concepts for unstructured meshes in order to properly resolve all problem-inherent specifics and for maintaining a moderate number of unknowns. However, flexibility in the meshes goes along with severe drawbacks with respect to parallel execution – especially with respect to the definition of adequate smoothers. This point becomes in particular pronounced in the framework of fine-grained parallelism on GPUs with hundreds of execution units. We use the approach of matrix-based multigrid that has high flexibility and adapts well to the exigences of modern computing platforms.
In this work we investigate multi-colored Gauss-Seidel type smoothers, the power(q)-pattern enhanced multi-colored ILU(p) smoothers with fill-ins, and factorized sparse approximate inverse (FSAI) smoothers. These approaches provide efficient smoothers with a high degree of parallelism. In combination with matrix-based multigrid methods on unstructured meshes our smoothers provide powerful solvers that are applicable across a wide range of parallel computing platforms and almost arbitrary geometries. We describe the configuration of our smoothers in the context of the portable lmpLAtoolbox and the HiFlow3 parallel finite element package. In our approach, a single source code can be used across diverse platforms including multicore CPUs and GPUs. Highly optimized implementations are hidden behind a unified user interface. Efficiency and scalability of our multigrid solvers are demonstrated by means of a comprehensive performance analysis on multicore CPUs and GPUs.
V. Heuveline, D. Lukarski, N. Trost and J.-P. Weiss. Parallel Smoothers for Matrix-based Multigrid Methods on Unstructured Meshes Using Multicore CPUs and GPUs. EMCL Preprint Series No. 9. 2011.
We present a highly parallel implementation of the cross-correlation of time-series data using graphics processing units (GPUs), which is scalable to hundreds of independent inputs and suitable for the processing of signals from “Large-N” arrays of many radio antennas. The computational part of the algorithm, the X-engine, is implementated efficiently on Nvidia’s Fermi architecture, sustaining up to 79% of the peak single precision floating-point throughput. We compare performance obtained for hardware- and software-managed caches, observing significantly better performance for the latter. The high performance reported involves use of a multi-level data tiling strategy in memory and use of a pipelined algorithm with simultaneous computation and transfer of data from host to device memory. The speed of code development, flexibility, and low cost of the GPU implementations compared to ASIC and FPGA implementations have the potential to greatly shorten the cycle of correlator development and deployment, for cases where some power consumption penalty can be tolerated.
(M. A. Clark, P. C. La Plante, L. J. Greenhill: “Accelerating Radio Astronomy Cross-Correlation with Graphics Processing Units”, July 2011. [Preprint on ARXIV] [Sources on GITHUB])
Functional magnetic resonance imaging (fMRI) makes it possible to non-invasively measure brain activity with high spatial resolution. There are however a number of issues that have to be addressed. One is the large amount of spatio-temporal data that needs to be processed. In addition to the statistical analysis itself, several preprocessing steps, such as slice timing correction and motion compensation, are normally applied. The high computational power of modern graphic cards has already successfully been used for MRI and fMRI. Going beyond the ﬁrst published demonstration of GPU-based analysis of fMRI data, all the preprocessing steps and two statistical approaches, the general linear model (GLM) and canonical correlation analysis (CCA), have been implemented on a GPU. For an fMRI dataset of typical size (80 volumes with 64 x 64 x 22 voxels), all the preprocessing takes about 0.5 s on the GPU, compared to 5 s with an optimized CPU implementation and 120 s with the commonly used statistical parametric mapping (SPM) software. A random permutation test with 10 000 permutations, with smoothing in each permutation, takes about 50 s if three GPUs are used, compared to 0.5 – 2.5 h with an optimized CPU implementation. The presented work will save time for researchers and clinicians in their daily work and enables the use of more advanced analysis, such as non-parametric statistics, both for conventional fMRI and for real-time fMRI.
(Anders Eklund, Mats Andersson, Hans Knutsson: “fMRI Analysis on the GPU – Possibilities and Challenges”, Computer Methods and Programs in Biomedicine, 2011 [DOI])
Parametric statistical methods, such as Z-, t-, and F-values are traditionally employed in functional magnetic resonance imaging (fMRI) for identifying areas in the brain that are active with a certain degree of statistical significance. These parametric methods, however, have two major drawbacks. First, it is assumed that the observed data are Gaussian distributed and independent; assumptions that generally are not valid for fMRI data. Second, the statistical test distribution can be derived theoretically only for very simple linear detection statistics. With non-parametric statistical methods, the two limitations described above can be overcome. The major drawback of non-parametric methods is the computational burden with processing times ranging from hours to days, which so far have made them impractical for routine use in single subject fMRI analysis. In this work, it is shown how the computational power of cost-efficient Graphics Processing Units (GPUs) can be used to speed up random permutation tests. A test with 10 000 permutations takes less than a minute, making statistical analysis of advanced detection methods in fMRI practically feasible. To exemplify the permutation based approach, brain activity maps generated by the General Linear Model (GLM) and Canonical Correlation Analysis (CCA) are compared at the same significance level. During the development of the routines and writing of the paper, 3-4 years of processing time has been saved by using the GPU.
(Anders Eklund, Mats Andersson, Hans Knutsson: “Fast Random Permutation Tests Enable Objective Evaluation of Methods for Single Subject fMRI Analysis”, International Journal of Biomedical Imaging, Article ID 627947, 2011 [Youtube Video] [PDF])
The use of image denoising techniques is an important part of many medical imaging applications. One common application is to improve the image quality of low-dose, i.e. noisy, computed tomography (CT) data. The medical imaging domain has seen a tremendous development during the last decades. It is now possible to collect time resolved volumes, i.e. 4D data, with a number of modalities (e.g. ultrasound (US), CT, magnetic resonance imaging (MRI)). While 3D image denoising previously has been applied to several volumes independently, there has not been much work done on true 4D image denoising, where the algorithm considers several volumes at the same time (and not a single volume at a time). By using all the dimensions, it is for example possible to remove some of the time varying reconstruction artefacts that exist in CT volumes. The problem with 4D image denoising, compared to 2D and 3D denoising, is that the computational complexity increases exponentially. In this paper we describe a novel algorithm for true 4D image denoising, based on local adaptive ﬁltering, and how to implement it on the graphics processing unit (GPU). The algorithm was applied to a 4D CT heart dataset of the resolution 512 x 512 x 445 x 20. The result is that the GPU can complete the denoising in about 25 minutes if spatial ﬁltering is used and in about 8 minutes if FFT based ﬁltering is used. The CPU implementation requires several days of processing time for spatial ﬁltering and about 50 minutes for FFT based ﬁltering. Fast spatial ﬁltering makes it possible to apply the denoising algorithm to larger datasets (compared to if FFT based ﬁltering is used). The short processing time increases the clinical value of true 4D image denoising signiﬁcantly.
(Anders Eklund, Mats Andersson, Hans Knutsson: “True 4D Image Denoising on the GPU”, International Journal of Biomedical Imaging, Article ID 952819, 2011 [Youtube Video] [PDF])
In this work, we present an interactive visual clustering approach for the exploration and analysis of vast volumes of data. Our proposed approach is a bio-inspired collective behavioral model to be used in a 3D graphics environment. Our paper illustrates an extension of the behavioral model for clustering and a parallel implementation, using Compute Unified Device Architecture to exploit the computational power of Graphics Processor Units (GPUs). The advantage of our approach is that, as data enters the environment, the user is directly involved in the data mining process. Our experiments illustrate the effectiveness and efficiency provided by our approach when applied to a number of real and synthetic data sets.
(U. Erra, B. Frola, and V. Scarano: “A GPU-based Interactive Bio-inspired Visual Clustering”, Proceedings of the 2011 IEEE Symposium on Computational Intelligence and Data Mining. Paris, France. April 11-15, 2011 [PDF] [Video])
Application demands and grand challenges in numerical simulation require for both highly capable computing platforms and efficient numerical solution schemes. Power constraints and further miniaturization of modern and future hardware give way for multi- and manycore processors with increasing fine-grained parallelism and deeply nested hierarchical memory systems — as already exemplified by recent graphics processing units. Accordingly, numerical schemes need to be adapted and re-engineered in order to deliver scalable solutions across diverse processor configurations. Portability of parallel software solutions across emerging hardware platforms is another challenge. This work investigates multi-coloring and re-ordering schemes for block Gauss-Seidel methods and, in particular, for incomplete LU factorizations with and without fill-ins. We consider two matrix re-ordering schemes that deliver flexible and efficient parallel preconditioners. The general idea is to generate block decompositions of the system matrix such that the diagonal blocks are diagonal itself. In such a way, parallelism can be exploited on the block-level in a scalable manner. Our goal is to provide widely applicable, out-of-the-box preconditioners that can be used in the context of finite element solvers.
We propose a new method for anticipating the fill-in pattern of ILU(p) schemes which we call the power(q)-pattern method. This method is based on an incomplete factorization of the system matrix A subject to a predetermined pattern given by the matrix power |A|p+1 and its associated multi-coloring permutation pi. We prove that the obtained sparsity pattern is a superset of our modified ILU(p) factorization applied to pi A pi-1. As a result, this modified ILU(p) applied to multi-colored system matrix has no fill-ins in its diagonal blocks. This leads to an inherently parallel execution of triangular ILU(p) sweeps.
In addition, we describe the integration of the preconditioners into the HiFlow3 open-source finite element package that provides a portable software solution across diverse hardware platforms. On this basis, we conduct performance analysis across a variety of test problems on multi-core CPUs and GPUs that proves efficiency, scalability and flexibility of our approach. Our preconditioners achieve a solver acceleration by a factor of up to 1.5, 8 and 85 for three different test problems. The GPU versions of the preconditioned solver are by a factor of up to 4 faster than an OpenMP parallel version on eight cores.
(Vincent Heuveline, Dimitar Lukarski and Jan-Philipp Weiss: “Enhanced Parallel ILU(p)-based Preconditioners for Multi-core CPUs and GPUs — The Power(q)-pattern Method”, EMCL Preprint Series, number 08, July 2011 [PDF])
A novel algorithm for solving in parallel a sparse triangular linear system on a graphical processing unit is proposed. It implements the solution of the triangular system in two phases. First, the analysis phase builds a dependency graph based on the matrix sparsity pattern and groups the independent rows into levels. Second, the solve phase obtains the full solution by iterating sequentially across the constructed levels. The solution elements corresponding to each single level are obtained at once in parallel. The numerical experiments are also presented and it is shown that the incomplete-LU and Cholesky preconditioned iterative methods, using the parallel sparse triangular solve algorithm, can achieve on average more than 2x speedup on graphical processing units (GPUs) over their CPU implementation.
(Maxim Naumov: “Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU”, NVIDIA Technical Report, June 2011. [WWW])
This paper describes the approach and the speedup obtained in performing Smith-Waterman database searches on heterogeneous platforms comprising of multi core CPU and multi GPU systems. Most of the advanced and optimized Smith-Waterman algorithm versions have demonstrated remarkable speedup over NCBI BLAST versions, viz., SWPS3 based on x86 SSE2 instructions and CUDASW++ v2.0 CUDA implementation on GPU. This work proposes a hybrid Smith-Waterman algorithm that integrates the state-of-the art CPU and GPU solutions for accelerating Smith-Waterman algorithm in which GPU acts as a co-processor and shares the workload with the CPU enabling us to realize remarkable performance of over 70 GCUPS resulting from simultaneous CPU-GPU execution. In this work, both CPU and GPU are graded equally in performance for Smith-Waterman rather than previous approaches of porting the computationally intensive portions onto the GPUs or a naive multi-core CPU approach.
(J. Singh and I. Aruni: “Accelerating Smith-Waterman on Heterogeneous CPU-GPU Systems”, Proceedings of Bioinformatics and Biomedical Engineering (iCBBE), May 2011. [DOI])