fMRI Analysis on the GPU – Possibilities and Challenges

July 17th, 2011

Abstract:

Functional magnetic resonance imaging (fMRI) makes it possible to non-invasively measure brain activity with high spatial resolution. There are however a number of issues that have to be addressed. One is the large amount of spatio-temporal data that needs to be processed. In addition to the statistical analysis itself, several preprocessing steps, such as slice timing correction and motion compensation, are normally applied. The high computational power of modern graphic cards has already successfully been used for MRI and fMRI. Going beyond the first published demonstration of GPU-based analysis of fMRI data, all the preprocessing steps and two statistical approaches, the general linear model (GLM) and canonical correlation analysis (CCA), have been implemented on a GPU. For an fMRI dataset of typical size (80 volumes with 64 x 64 x 22 voxels), all the preprocessing takes about 0.5 s on the GPU, compared to 5 s with an optimized CPU implementation and 120 s with the commonly used statistical parametric mapping (SPM) software. A random permutation test with 10 000 permutations, with smoothing in each permutation, takes about 50 s if three GPUs are used, compared to 0.5 – 2.5 h with an optimized CPU implementation. The presented work will save time for researchers and clinicians in their daily work and enables the use of more advanced analysis, such as non-parametric statistics, both for conventional fMRI and for real-time fMRI.

(Anders Eklund, Mats Andersson, Hans Knutsson: “fMRI Analysis on the GPU – Possibilities and Challenges”, Computer Methods and Programs in Biomedicine, 2011 [DOI])

Fast Random Permutation Tests Enable Objective Evaluation of Methods for Single Subject fMRI Analysis

July 17th, 2011

Abstract:

Parametric statistical methods, such as Z-, t-, and F-values are traditionally employed in functional magnetic resonance imaging (fMRI) for identifying areas in the brain that are active with a certain degree of statistical significance. These parametric methods, however, have two major drawbacks. First, it is assumed that the observed data are Gaussian distributed and independent; assumptions that generally are not valid for fMRI data. Second, the statistical test distribution can be derived theoretically only for very simple linear detection statistics. With non-parametric statistical methods, the two limitations described above can be overcome. The major drawback of non-parametric methods is the computational burden with processing times ranging from hours to days, which so far have made them impractical for routine use in single subject fMRI analysis. In this work, it is shown how the computational power of cost-efficient Graphics Processing Units (GPUs) can be used to speed up random permutation tests. A test with 10 000 permutations takes less than a minute, making statistical analysis of advanced detection methods in fMRI practically feasible. To exemplify the permutation based approach, brain activity maps generated by the General Linear Model (GLM) and Canonical Correlation Analysis (CCA) are compared at the same significance level. During the development of the routines and writing of the paper, 3-4 years of processing time has been saved by using the GPU.

(Anders Eklund, Mats Andersson, Hans Knutsson: “Fast Random Permutation Tests Enable Objective Evaluation of Methods for Single Subject fMRI Analysis”, International Journal of Biomedical Imaging, Article ID 627947, 2011 [Youtube Video] [PDF])

True 4D Image Denoising on the GPU

July 17th, 2011

Abstract:

The use of image denoising techniques is an important part of many medical imaging applications. One common application is to improve the image quality of low-dose, i.e. noisy, computed tomography (CT) data. The medical imaging domain has seen a tremendous development during the last decades. It is now possible to collect time resolved volumes, i.e. 4D data, with a number of modalities (e.g. ultrasound (US), CT, magnetic resonance imaging (MRI)). While 3D image denoising previously has been applied to several volumes independently, there has not been much work done on true 4D image denoising, where the algorithm considers several volumes at the same time (and not a single volume at a time). By using all the dimensions, it is for example possible to remove some of the time varying reconstruction artefacts that exist in CT volumes. The problem with 4D image denoising, compared to 2D and 3D denoising, is that the computational complexity increases exponentially. In this paper we describe a novel algorithm for true 4D image denoising, based on local adaptive filtering, and how to implement it on the graphics processing unit (GPU). The algorithm was applied to a 4D CT heart dataset of the resolution 512 x 512 x 445 x 20. The result is that the GPU can complete the denoising in about 25 minutes if spatial filtering is used and in about 8 minutes if FFT based filtering is used. The CPU implementation requires several days of processing time for spatial filtering and about 50 minutes for FFT based filtering. Fast spatial filtering makes it possible to apply the denoising algorithm to larger datasets (compared to if FFT based filtering is used). The short processing time increases the clinical value of true 4D image denoising significantly.

(Anders Eklund, Mats Andersson, Hans Knutsson: “True 4D Image Denoising on the GPU”, International Journal of Biomedical Imaging, Article ID 952819, 2011 [Youtube Video] [PDF])

A GPU-based Interactive Bio-inspired Visual Clustering

July 12th, 2011

Abstract:

In this work, we present an interactive visual clustering approach for the exploration and analysis of vast volumes of data. Our proposed approach is a bio-inspired collective behavioral model to be used in a 3D graphics environment. Our paper illustrates an extension of the behavioral model for clustering and a parallel implementation, using Compute Unified Device Architecture to exploit the computational power of Graphics Processor Units (GPUs). The advantage of our approach is that, as data enters the environment, the user is directly involved in the data mining process. Our experiments illustrate the effectiveness and efficiency provided by our approach when applied to a number of real and synthetic data sets.

(U. Erra, B. Frola, and V. Scarano: “A GPU-based Interactive Bio-inspired Visual Clustering”, Proceedings of the 2011 IEEE Symposium on Computational Intelligence and Data Mining. Paris, France. April 11-15, 2011 [PDF] [Video])

Enhanced Parallel ILU(p)-based Preconditioners for Multi-core CPUs and GPUs — The Power(q)-pattern Method

July 8th, 2011

Abstract:

Application demands and grand challenges in numerical simulation require for both highly capable computing platforms and efficient numerical solution schemes. Power constraints and further miniaturization of modern and future hardware give way for multi- and manycore processors with increasing fine-grained parallelism and deeply nested hierarchical memory systems — as already exemplified by recent graphics processing units. Accordingly, numerical schemes need to be adapted and re-engineered in order to deliver scalable solutions across diverse processor configurations. Portability of parallel software solutions across emerging hardware platforms is another challenge. This work investigates multi-coloring and re-ordering schemes for block Gauss-Seidel methods and, in particular, for incomplete LU factorizations with and without fill-ins. We consider two matrix re-ordering schemes that deliver flexible and efficient parallel preconditioners. The general idea is to generate block decompositions of the system matrix such that the diagonal blocks are diagonal itself. In such a way, parallelism can be exploited on the block-level in a scalable manner. Our goal is to provide widely applicable, out-of-the-box preconditioners that can be used in the context of finite element solvers.

We propose a new method for anticipating the fill-in pattern of ILU(p) schemes which we call the power(q)-pattern method. This method is based on an incomplete factorization of the system matrix A subject to a predetermined pattern given by the matrix power |A|p+1 and its associated multi-coloring permutation pi. We prove that the obtained sparsity pattern is a superset of our modified ILU(p) factorization applied to pi A pi-1. As a result, this modified ILU(p) applied to multi-colored system matrix has no fill-ins in its diagonal blocks. This leads to an inherently parallel execution of triangular ILU(p) sweeps.

In addition, we describe the integration of the preconditioners into the HiFlow3 open-source finite element package that provides a portable software solution across diverse hardware platforms. On this basis, we conduct performance analysis across a variety of test problems on multi-core CPUs and GPUs that proves efficiency, scalability and flexibility of our approach. Our preconditioners achieve a solver acceleration by a factor of up to 1.5, 8 and 85 for three different test problems. The GPU versions of the preconditioned solver are by a factor of up to 4 faster than an OpenMP parallel version on eight cores.

(Vincent Heuveline, Dimitar Lukarski and Jan-Philipp Weiss: “Enhanced Parallel ILU(p)-based Preconditioners for Multi-core CPUs and GPUs — The Power(q)-pattern Method”, EMCL Preprint Series, number 08, July 2011 [PDF])

Parallel Solution of Sparse Triangular Linear Systems

June 26th, 2011

Abstract:

A novel algorithm for solving in parallel a sparse triangular linear system on a graphical processing unit is proposed. It implements the solution of the triangular system in two phases. First, the analysis phase builds a dependency graph based on the matrix sparsity pattern and groups the independent rows into levels. Second, the solve phase obtains the full solution by iterating sequentially across the constructed levels. The solution elements corresponding to each single level are obtained at once in parallel. The numerical experiments are also presented and it is shown that the incomplete-LU and Cholesky preconditioned iterative methods, using the parallel sparse triangular solve algorithm, can achieve on average more than 2x speedup on graphical processing units (GPUs) over their CPU implementation.

(Maxim Naumov: “Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU”, NVIDIA Technical Report, June 2011. [WWW])

Accelerating Smith-Waterman on Heterogeneous CPU-GPU Systems

June 26th, 2011

Abstract:

This paper describes the approach and the speedup obtained in performing Smith-Waterman database searches on heterogeneous platforms comprising of multi core CPU and multi GPU systems. Most of the advanced and optimized Smith-Waterman algorithm versions have demonstrated remarkable speedup over NCBI BLAST versions, viz., SWPS3 based on x86 SSE2 instructions and CUDASW++ v2.0 CUDA implementation on GPU. This work proposes a hybrid Smith-Waterman algorithm that integrates the state-of-the art CPU and GPU solutions for accelerating Smith-Waterman algorithm in which GPU acts as a co-processor and shares the workload with the CPU enabling us to realize remarkable performance of over 70 GCUPS resulting from simultaneous CPU-GPU execution. In this work, both CPU and GPU are graded equally in performance for Smith-Waterman rather than previous approaches of porting the computationally intensive portions onto the GPUs or a naive multi-core CPU approach.

(J. Singh and I. Aruni: “Accelerating Smith-Waterman on Heterogeneous CPU-GPU Systems”, Proceedings of Bioinformatics and Biomedical Engineering (iCBBE), May 2011. [DOI])

Scalable instruction set simulator for thousand-core architectures running on GPGPUs.

June 26th, 2011

Abstract:

Simulators are still the primary tools for development and performance evaluation of applications running on massively parallel architectures. However, current virtual platforms are not able to tackle the complexity issues introduced by 1000-core future scenarios. We present a fast and accurate simulation framework targeting extremely large parallel systems by specifically taking advantage of the inherent potential processing parallelism available in modern GPGPUs.

(S. Raghav, M. Ruggiero, D. Atienza, C. Pinto, A. Marongiu and L. Benini: “Scalable instruction set simulator for thousand-core architectures running on GPGPUs”, Proceedings of High Performance Computing and Simulation (HPCS), pp.459-466, June/July 2010. [DOI] [WWW])

GPIUTMD 0.9.6 released

June 26th, 2011

GPIUTMD stands for Graphic Processors at Isfahan University of Technology for Many-particle Dynamics. It performs general-purpose many-particle dynamic simulations on a single workstation, taking advantage of NVIDIA GPUs to attain a level of performance equivalent to thousands of cores on a fast cluster. Flexible and configurable, GPIUTMD is currently being used for all atom and coarse-grained molecular dynamics simulations of nano-materials, glasses, and surfactants; dissipative particle dynamics simulations (DPD) of polymers; and crystallization of metals using EAM potentials. GPIUTMD 0.9.6 adds many new features. Highlights include:

  • Morse bond potential
  • Adding constant acceleration to a group of particles. (useful for modeling gravity effects)
  • Computes the full virial stress tensor (useful in mechanical characterization of materials)
  • Long-ranged electrostatics via PPPM
  • Support for CUDA 3.2
  • Theory manual
  • Up to twenty percent boost in simulations
  • and more

A demo version of GPIUTMD 0.9.6 will be available soon for download under an open source license. Check out the quick start tutorial to get started, or check out the full documentation to see everything it can do.

 

CheCL: Transparent Checkpointing and Process Migration of OpenCL Applications

June 26th, 2011

Abstract:

We propose a new transparent checkpoint/restart (CPR) tool, named CheCL, for high performance and dependable GPU computing. CheCL can perform CPR on an OpenCL application program without any modification and recompilation of its code. A conventional checkpointing system fails to checkpoint a process if the process uses OpenCL. Therefore, in CheCL, every API call is forwarded to another process called an API proxy, and the API proxy invokes the API function; two processes, an application process and an API proxy, are launched for an OpenCL application. In this case, as the application process is not an OpenCL process but a standard process, it can be safely checkpointed. While CheCL intercepts all API calls, it records the information necessary for restoring OpenCL objects. The application process does not hold any OpenCL handles, but CheCL handles to keep such information. Those handles are automatically converted to OpenCL handles and then passed to API functions. Upon restart, OpenCL objects are automatically restored based on the recorded information. This paper demonstrates the feasibility of transparent checkpointing of OpenCL programs including MPI applications, and quantitatively evaluates the runtime overheads. It is also discussed that CheCL can enable process migration of OpenCL applications among distinct nodes, and among different kinds of compute devices such as a CPU and a GPU.

(Hiroyuki Takizawa, Kentaro Koyama, Katuto Sato, Kazuhiko Komatsu, and Hiroaki Kobayashi: “CheCL: Transparent Checkpointing and Process Migration of OpenCL Applications”, Proceedings of International Parallel and Distributed Processing Symposium (IPDPS11), 2011. [PDF])

 

Page 5 of 47« First...34567...102030...Last »