You are here: Home » Archives for NVIDIA CUDA
July 29th, 2011
TidePowerd has released Version 2 of their GPU computing solution for the .NET framework, GPU.NET. Their platform allows developers to quickly and easily write GPU-accelerated applications completely in .NET-based languages. Some key benefits include:
- Stay in C# and treat kernel methods like any regular method
- “Boilerplate” GPU programming tasks such as memory transfer and GPU scheduling are abstracted from the developer
- Cross-platform and cross-hardware with a single binary
- Systems seamlessly adapt to new hardware without rewriting code
- Speed on par with native code
New version 2 features:
- Visual Studio Error list and IntelliSense integration
- On-device random number generation
- Double precision support
A free 30-days evaluation license is available, as well as in-depth examples and tutorials.
Posted in Business, Developer Resources | Tags: .NET, C#, NVIDIA CUDA, Tools | 1 Comment
July 24th, 2011
Jacket 1.8 and LibJacket 1.1 have been released by Accelereyes, enabling GPU support for MATLAB and easier CUDA development with C/C++/Fortran and Python. New features include:
- Expanded support for the Signal Processing, Image Processing, and Statistics Libraries included with both Jacket and LibJacket
- Faster linear algebra for special systems (e.g. symmetric, positive definite, triangular, etc.)
- Enhanced visualizations
- New and updated examples: FDTD, Mandelbrot fractals, maximum-likelihood neural segmentation, MDS for genomics
- Built with CUDA 4.0 for peak performance
Visit http://www.accelereyes.com/ for details, downloads, whitepapers and tutorials.
Posted in Business, Developer Resources | Tags: Fortran, Libraries, MATLAB, NVIDIA CUDA, Programming Environments, Python | Write a comment
July 24th, 2011
TunaCode is pleased to announce the release of CUVI (CUDA Vision and Imaging Library) version 0.5 which comes with a new API and new features. This release makes it even simpler to add acceleration to existing Imaging applications, without any prior technical knowledge of GPUs. CUVI v0.5 is built from bottom up with performance and ease-of-use in mind.
CUVI version 0.5 is available for download at http://cuvilib.com and is available for Windows (Win32, x64) with planned support for Linux and Mac.
Posted in Business, Developer Resources | Tags: Image Processing, Libraries, NVIDIA CUDA | Write a comment
July 22nd, 2011
Abstract:
We present a highly parallel implementation of the cross-correlation of time-series data using graphics processing units (GPUs), which is scalable to hundreds of independent inputs and suitable for the processing of signals from “Large-N” arrays of many radio antennas. The computational part of the algorithm, the X-engine, is implementated efficiently on Nvidia’s Fermi architecture, sustaining up to 79% of the peak single precision floating-point throughput. We compare performance obtained for hardware- and software-managed caches, observing significantly better performance for the latter. The high performance reported involves use of a multi-level data tiling strategy in memory and use of a pipelined algorithm with simultaneous computation and transfer of data from host to device memory. The speed of code development, flexibility, and low cost of the GPU implementations compared to ASIC and FPGA implementations have the potential to greatly shorten the cycle of correlator development and deployment, for cases where some power consumption penalty can be tolerated.
(M. A. Clark, P. C. La Plante, L. J. Greenhill: “Accelerating Radio Astronomy Cross-Correlation with Graphics Processing Units”, July 2011. [Preprint on ARXIV] [Sources on GITHUB])
Posted in Research | Tags: Astronomy, Cross-correlation, NVIDIA CUDA, Papers | 2 Comments
July 17th, 2011
A new alpha release of rCUDA 3.0 (Remote CUDA), the Open Source package that allows performing CUDA calls to remote GPUs, has been released. Major improvements included in this new version are:
- Partially updated API to 4.0
- Added compatibility support with CUDA 4.0 environment
- Updated CUBLAS API to 4.0 for the most common CUBLAS routines
- Fixed some bugs
- General performance improvements
For further information, please visit the rCUDA webpage.
Posted in Developer Resources | Tags: Clusters, High-Performance Computing, Libraries, NVIDIA CUDA, Virtualisation | Write a comment
July 17th, 2011
Abstract:
Functional magnetic resonance imaging (fMRI) makes it possible to non-invasively measure brain activity with high spatial resolution. There are however a number of issues that have to be addressed. One is the large amount of spatio-temporal data that needs to be processed. In addition to the statistical analysis itself, several preprocessing steps, such as slice timing correction and motion compensation, are normally applied. The high computational power of modern graphic cards has already successfully been used for MRI and fMRI. Going beyond the first published demonstration of GPU-based analysis of fMRI data, all the preprocessing steps and two statistical approaches, the general linear model (GLM) and canonical correlation analysis (CCA), have been implemented on a GPU. For an fMRI dataset of typical size (80 volumes with 64 x 64 x 22 voxels), all the preprocessing takes about 0.5 s on the GPU, compared to 5 s with an optimized CPU implementation and 120 s with the commonly used statistical parametric mapping (SPM) software. A random permutation test with 10 000 permutations, with smoothing in each permutation, takes about 50 s if three GPUs are used, compared to 0.5 – 2.5 h with an optimized CPU implementation. The presented work will save time for researchers and clinicians in their daily work and enables the use of more advanced analysis, such as non-parametric statistics, both for conventional fMRI and for real-time fMRI.
(Anders Eklund, Mats Andersson, Hans Knutsson: “fMRI Analysis on the GPU – Possibilities and Challenges”, Computer Methods and Programs in Biomedicine, 2011 [DOI])
Posted in Research | Tags: Image Processing, Medical Imaging, NVIDIA CUDA, Papers | Write a comment
July 17th, 2011
Abstract:
Parametric statistical methods, such as Z-, t-, and F-values are traditionally employed in functional magnetic resonance imaging (fMRI) for identifying areas in the brain that are active with a certain degree of statistical significance. These parametric methods, however, have two major drawbacks. First, it is assumed that the observed data are Gaussian distributed and independent; assumptions that generally are not valid for fMRI data. Second, the statistical test distribution can be derived theoretically only for very simple linear detection statistics. With non-parametric statistical methods, the two limitations described above can be overcome. The major drawback of non-parametric methods is the computational burden with processing times ranging from hours to days, which so far have made them impractical for routine use in single subject fMRI analysis. In this work, it is shown how the computational power of cost-efficient Graphics Processing Units (GPUs) can be used to speed up random permutation tests. A test with 10 000 permutations takes less than a minute, making statistical analysis of advanced detection methods in fMRI practically feasible. To exemplify the permutation based approach, brain activity maps generated by the General Linear Model (GLM) and Canonical Correlation Analysis (CCA) are compared at the same significance level. During the development of the routines and writing of the paper, 3-4 years of processing time has been saved by using the GPU.
(Anders Eklund, Mats Andersson, Hans Knutsson: “Fast Random Permutation Tests Enable Objective Evaluation of Methods for Single Subject fMRI Analysis”, International Journal of Biomedical Imaging, Article ID 627947, 2011 [Youtube Video] [PDF])
Posted in Research | Tags: Image Processing, Medical Imaging, NVIDIA CUDA, Papers | Write a comment
July 17th, 2011
Abstract:
The use of image denoising techniques is an important part of many medical imaging applications. One common application is to improve the image quality of low-dose, i.e. noisy, computed tomography (CT) data. The medical imaging domain has seen a tremendous development during the last decades. It is now possible to collect time resolved volumes, i.e. 4D data, with a number of modalities (e.g. ultrasound (US), CT, magnetic resonance imaging (MRI)). While 3D image denoising previously has been applied to several volumes independently, there has not been much work done on true 4D image denoising, where the algorithm considers several volumes at the same time (and not a single volume at a time). By using all the dimensions, it is for example possible to remove some of the time varying reconstruction artefacts that exist in CT volumes. The problem with 4D image denoising, compared to 2D and 3D denoising, is that the computational complexity increases exponentially. In this paper we describe a novel algorithm for true 4D image denoising, based on local adaptive filtering, and how to implement it on the graphics processing unit (GPU). The algorithm was applied to a 4D CT heart dataset of the resolution 512 x 512 x 445 x 20. The result is that the GPU can complete the denoising in about 25 minutes if spatial filtering is used and in about 8 minutes if FFT based filtering is used. The CPU implementation requires several days of processing time for spatial filtering and about 50 minutes for FFT based filtering. Fast spatial filtering makes it possible to apply the denoising algorithm to larger datasets (compared to if FFT based filtering is used). The short processing time increases the clinical value of true 4D image denoising significantly.
(Anders Eklund, Mats Andersson, Hans Knutsson: “True 4D Image Denoising on the GPU”, International Journal of Biomedical Imaging, Article ID 952819, 2011 [Youtube Video] [PDF])
Posted in Research | Tags: Image Processing, Medical Imaging, NVIDIA CUDA, Papers | 1 Comment
July 12th, 2011
Abstract:
In this work, we present an interactive visual clustering approach for the exploration and analysis of vast volumes of data. Our proposed approach is a bio-inspired collective behavioral model to be used in a 3D graphics environment. Our paper illustrates an extension of the behavioral model for clustering and a parallel implementation, using Compute Unified Device Architecture to exploit the computational power of Graphics Processor Units (GPUs). The advantage of our approach is that, as data enters the environment, the user is directly involved in the data mining process. Our experiments illustrate the effectiveness and efficiency provided by our approach when applied to a number of real and synthetic data sets.
(U. Erra, B. Frola, and V. Scarano: “A GPU-based Interactive Bio-inspired Visual Clustering”, Proceedings of the 2011 IEEE Symposium on Computational Intelligence and Data Mining. Paris, France. April 11-15, 2011 [PDF] [Video])
Posted in Research | Tags: Data Mining, NVIDIA CUDA, Papers | Write a comment
July 8th, 2011
Abstract:
Application demands and grand challenges in numerical simulation require for both highly capable computing platforms and efficient numerical solution schemes. Power constraints and further miniaturization of modern and future hardware give way for multi- and manycore processors with increasing fine-grained parallelism and deeply nested hierarchical memory systems — as already exemplified by recent graphics processing units. Accordingly, numerical schemes need to be adapted and re-engineered in order to deliver scalable solutions across diverse processor configurations. Portability of parallel software solutions across emerging hardware platforms is another challenge. This work investigates multi-coloring and re-ordering schemes for block Gauss-Seidel methods and, in particular, for incomplete LU factorizations with and without fill-ins. We consider two matrix re-ordering schemes that deliver flexible and efficient parallel preconditioners. The general idea is to generate block decompositions of the system matrix such that the diagonal blocks are diagonal itself. In such a way, parallelism can be exploited on the block-level in a scalable manner. Our goal is to provide widely applicable, out-of-the-box preconditioners that can be used in the context of finite element solvers.
We propose a new method for anticipating the fill-in pattern of ILU(p) schemes which we call the power(q)-pattern method. This method is based on an incomplete factorization of the system matrix A subject to a predetermined pattern given by the matrix power |A|p+1 and its associated multi-coloring permutation pi. We prove that the obtained sparsity pattern is a superset of our modified ILU(p) factorization applied to pi A pi-1. As a result, this modified ILU(p) applied to multi-colored system matrix has no fill-ins in its diagonal blocks. This leads to an inherently parallel execution of triangular ILU(p) sweeps.
In addition, we describe the integration of the preconditioners into the HiFlow3 open-source finite element package that provides a portable software solution across diverse hardware platforms. On this basis, we conduct performance analysis across a variety of test problems on multi-core CPUs and GPUs that proves efficiency, scalability and flexibility of our approach. Our preconditioners achieve a solver acceleration by a factor of up to 1.5, 8 and 85 for three different test problems. The GPU versions of the preconditioned solver are by a factor of up to 4 faster than an OpenMP parallel version on eight cores.
(Vincent Heuveline, Dimitar Lukarski and Jan-Philipp Weiss: “Enhanced Parallel ILU(p)-based Preconditioners for Multi-core CPUs and GPUs — The Power(q)-pattern Method”, EMCL Preprint Series, number 08, July 2011 [PDF])
Posted in Research | Tags: Multicore, NVIDIA CUDA, Papers, Sparse Linear Systems | Write a comment