Algebraic multigrid methods for large, sparse linear systems are a necessity in many computational simulations, yet parallel algorithms for such solvers are generally decomposed into coarse-grained tasks suitable for distributed computers with traditional processing cores. However, accelerating multigrid on massively parallel throughput-oriented processors, such as the GPU, demands algorithms with abundant fine-grained parallelism. In this paper, we develop a parallel algebraic multigrid method which exposes substantial fine-grained parallelism in both the construction of the multigrid hierarchy as well as the cycling or solve stage. Our algorithms are expressed in terms of scalable parallel primitives that are efficiently implemented on the GPU. The resulting solver achieves an average speedup of over 2x in the setup phase and around 6x in the cycling phase when compared to a representative CPU implementation.
(Nathan Bell, Steven Dalton and Luke Olson: “Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods”, NVIDIA Technical Report NVR-2011-002, June 2011 [PDF and Sources])
Multigrid methods are efficient and fast solvers for problems typically modeled by partial differential equations of elliptic type. For problems with complex geometries and local singularities stencil-type discrete operators on equidistant Cartesian grids need to be replaced by more flexible concepts for unstructured meshes in order to properly resolve all problem-inherent specifics and for maintaining a moderate number of unknowns. However, flexibility in the meshes goes along with severe drawbacks with respect to parallel execution – especially with respect to the definition of adequate smoothers. This point becomes in particular pronounced in the framework of fine-grained parallelism on GPUs with hundreds of execution units. We use the approach of matrix-based multigrid that has high flexibility and adapts well to the exigences of modern computing platforms.
In this work we investigate multi-colored Gauss-Seidel type smoothers, the power(q)-pattern enhanced multi-colored ILU(p) smoothers with fill-ins, and factorized sparse approximate inverse (FSAI) smoothers. These approaches provide efficient smoothers with a high degree of parallelism. In combination with matrix-based multigrid methods on unstructured meshes our smoothers provide powerful solvers that are applicable across a wide range of parallel computing platforms and almost arbitrary geometries. We describe the configuration of our smoothers in the context of the portable lmpLAtoolbox and the HiFlow3 parallel finite element package. In our approach, a single source code can be used across diverse platforms including multicore CPUs and GPUs. Highly optimized implementations are hidden behind a unified user interface. Efficiency and scalability of our multigrid solvers are demonstrated by means of a comprehensive performance analysis on multicore CPUs and GPUs.
V. Heuveline, D. Lukarski, N. Trost and J.-P. Weiss. Parallel Smoothers for Matrix-based Multigrid Methods on Unstructured Meshes Using Multicore CPUs and GPUs. EMCL Preprint Series No. 9. 2011.
TidePowerd has released Version 2 of their GPU computing solution for the .NET framework, GPU.NET. Their platform allows developers to quickly and easily write GPU-accelerated applications completely in .NET-based languages. Some key benefits include:
- Stay in C# and treat kernel methods like any regular method
- “Boilerplate” GPU programming tasks such as memory transfer and GPU scheduling are abstracted from the developer
- Cross-platform and cross-hardware with a single binary
- Systems seamlessly adapt to new hardware without rewriting code
- Speed on par with native code
New version 2 features:
- Visual Studio Error list and IntelliSense integration
- On-device random number generation
- Double precision support
A free 30-days evaluation license is available, as well as in-depth examples and tutorials.
Jacket 1.8 and LibJacket 1.1 have been released by Accelereyes, enabling GPU support for MATLAB and easier CUDA development with C/C++/Fortran and Python. New features include:
- Expanded support for the Signal Processing, Image Processing, and Statistics Libraries included with both Jacket and LibJacket
- Faster linear algebra for special systems (e.g. symmetric, positive definite, triangular, etc.)
- Enhanced visualizations
- New and updated examples: FDTD, Mandelbrot fractals, maximum-likelihood neural segmentation, MDS for genomics
- Built with CUDA 4.0 for peak performance
Visit http://www.accelereyes.com/ for details, downloads, whitepapers and tutorials.
TunaCode is pleased to announce the release of CUVI (CUDA Vision and Imaging Library) version 0.5 which comes with a new API and new features. This release makes it even simpler to add acceleration to existing Imaging applications, without any prior technical knowledge of GPUs. CUVI v0.5 is built from bottom up with performance and ease-of-use in mind.
CUVI version 0.5 is available for download at http://cuvilib.com and is available for Windows (Win32, x64) with planned support for Linux and Mac.
We present a highly parallel implementation of the cross-correlation of time-series data using graphics processing units (GPUs), which is scalable to hundreds of independent inputs and suitable for the processing of signals from “Large-N” arrays of many radio antennas. The computational part of the algorithm, the X-engine, is implementated efficiently on Nvidia’s Fermi architecture, sustaining up to 79% of the peak single precision floating-point throughput. We compare performance obtained for hardware- and software-managed caches, observing significantly better performance for the latter. The high performance reported involves use of a multi-level data tiling strategy in memory and use of a pipelined algorithm with simultaneous computation and transfer of data from host to device memory. The speed of code development, flexibility, and low cost of the GPU implementations compared to ASIC and FPGA implementations have the potential to greatly shorten the cycle of correlator development and deployment, for cases where some power consumption penalty can be tolerated.
(M. A. Clark, P. C. La Plante, L. J. Greenhill: “Accelerating Radio Astronomy Cross-Correlation with Graphics Processing Units”, July 2011. [Preprint on ARXIV] [Sources on GITHUB])
The Virtual School of Computational Science and Engineering (VSCSE) will offer a hands-on course for graduate students August 15-19:
Proven Algorithmic Techniques for Manycore Processors
This course will be delivered to a number of sites nationwide—including the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign—using high-definition video conferencing technologies. Students at all sites will be able to work with a cohort of fellow computational scientists, have access to local teaching assistants, and interact virtually with course instructors.
Registration for the weeklong course is $100. Please visit www.vscse.org for more information or hub.vscse.org to register.
Read the rest of this entry »
A new alpha release of rCUDA 3.0 (Remote CUDA), the Open Source package that allows performing CUDA calls to remote GPUs, has been released. Major improvements included in this new version are:
- Partially updated API to 4.0
- Added compatibility support with CUDA 4.0 environment
- Updated CUBLAS API to 4.0 for the most common CUBLAS routines
- Fixed some bugs
- General performance improvements
For further information, please visit the rCUDA webpage.
Functional magnetic resonance imaging (fMRI) makes it possible to non-invasively measure brain activity with high spatial resolution. There are however a number of issues that have to be addressed. One is the large amount of spatio-temporal data that needs to be processed. In addition to the statistical analysis itself, several preprocessing steps, such as slice timing correction and motion compensation, are normally applied. The high computational power of modern graphic cards has already successfully been used for MRI and fMRI. Going beyond the ﬁrst published demonstration of GPU-based analysis of fMRI data, all the preprocessing steps and two statistical approaches, the general linear model (GLM) and canonical correlation analysis (CCA), have been implemented on a GPU. For an fMRI dataset of typical size (80 volumes with 64 x 64 x 22 voxels), all the preprocessing takes about 0.5 s on the GPU, compared to 5 s with an optimized CPU implementation and 120 s with the commonly used statistical parametric mapping (SPM) software. A random permutation test with 10 000 permutations, with smoothing in each permutation, takes about 50 s if three GPUs are used, compared to 0.5 – 2.5 h with an optimized CPU implementation. The presented work will save time for researchers and clinicians in their daily work and enables the use of more advanced analysis, such as non-parametric statistics, both for conventional fMRI and for real-time fMRI.
(Anders Eklund, Mats Andersson, Hans Knutsson: “fMRI Analysis on the GPU – Possibilities and Challenges”, Computer Methods and Programs in Biomedicine, 2011 [DOI])
Parametric statistical methods, such as Z-, t-, and F-values are traditionally employed in functional magnetic resonance imaging (fMRI) for identifying areas in the brain that are active with a certain degree of statistical significance. These parametric methods, however, have two major drawbacks. First, it is assumed that the observed data are Gaussian distributed and independent; assumptions that generally are not valid for fMRI data. Second, the statistical test distribution can be derived theoretically only for very simple linear detection statistics. With non-parametric statistical methods, the two limitations described above can be overcome. The major drawback of non-parametric methods is the computational burden with processing times ranging from hours to days, which so far have made them impractical for routine use in single subject fMRI analysis. In this work, it is shown how the computational power of cost-efficient Graphics Processing Units (GPUs) can be used to speed up random permutation tests. A test with 10 000 permutations takes less than a minute, making statistical analysis of advanced detection methods in fMRI practically feasible. To exemplify the permutation based approach, brain activity maps generated by the General Linear Model (GLM) and Canonical Correlation Analysis (CCA) are compared at the same significance level. During the development of the routines and writing of the paper, 3-4 years of processing time has been saved by using the GPU.
(Anders Eklund, Mats Andersson, Hans Knutsson: “Fast Random Permutation Tests Enable Objective Evaluation of Methods for Single Subject fMRI Analysis”, International Journal of Biomedical Imaging, Article ID 627947, 2011 [Youtube Video] [PDF])