PARALUTION is a library for sparse iterative methods which can be performed on various parallel devices, including multi-core CPU, GPU (CUDA and OpenCL) and Intel Xeon Phi. The new 0.7.0 version provides the following new features:
- Windows support – full windows support for all backends (CUDA, OpenCL, OpenMP)
- Assembling function – new OpenMP parallel assembling function for sparse matrices (includes an update function for time-dependent problems)
- Direct (dense) solvers (for very small problems)
- (Restricted) Additive Schwarz preconditioners
- MATLAB/Octave plug-in
To avoid OpenMP overhead for small sized problems, the library will compute in serial if the size of the matrix/vector is below a pre-defined threshold. Internally, the OpenCL backend has been modified for simplified cross platform compilation.
A new version of the GPU-profiler for CUDA software stack is available at www.lab4241.com. The GPU-profiler is able to deliver per C++ source-code ‘inside’ kernel performance information in a simple, intuitive way, similar to known CPU domain profilers, like Quantify or Valgrind. The new version, GPUPROF version 0.3 (beta), includes improved stability, refined memory tracing, temporal memory analysis, and CUDA API-driver call tracing.
Spectral unmixing is an important task in remotely sensed hyperspectral data exploitation. The linear mixture model has been widely used to unmix hyperspectral images by identifying a set of pure spectral signatures, called endmembers, and estimating their respective abundances in each pixel of the scene. Several algorithms have been proposed in the recent literature to automatically identify endmembers, even if the original hyperspectral scene does not contain any pure signatures. A popular strategy for endmember identification in highly mixed hyperspectral scenes has been the minimum volume simplex analysis (MVSA), known to be a computationally very expensive algorithm. This algorithm calculates the minimum volume enclosing simplex, as opposed to other algorithms that perform maximum simplex volume analysis (MSVA). The high computational complexity of MVSA, together with its very high memory requirements, has limited its adoption in the hyperspectral imaging community. In this paper we develop several optimizations to the MVSA algorithm. The main computational task of MVSA is the solution of a quadratic optimization problem with equality and inequality constraints, with the inequality constraints being in the order of the number of pixels multiplied by the number of endmembers. As a result, storing and computing the inequality constraint matrix is highly inefficient. The first optimization presented in this paper uses algebra operations in order to reduce the memory requirements of the algorithm. In the second optimization, we use graphics processing units (GPUs) to effectively solve (in parallel) the quadratic optimization problem involved in the computation of MVSA. In the third optimization, we extend the single GPU implementation to a multi-GPU one, developing a hybrid strategy that distributes the computation while taking advantage of GPU accelerators at each node. The presented optimizations are tested in different analysis scenarios (using both synthetic and real hyperspectral data) and shown to provide state-of-the-art results from the viewpoint of unmixing accuracy and computational performance. The speedup achieved using the full GPU cluster compared to the CPU implementation in tenfold in a real hyperspectral image.
(A. Agathos, J. Li, D. Petcu and A. Plaza: “Multi-GPU Implementation of the Minimum Volume Simplex Analysis Algorithm for Hyperspectral Unmixing”. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, accepted for publication , 2014. [PDF] )
A new version of the rCUDA middleware has been released (version 4.1). In addition to fix some bugs related with asynchronous memory transfers, the new release provides support for:
- CUDA 5.5 Runtime API
- Mellanox Connect-IB network adapters
- Dynamic Parallelism
- cuFFT and cuBLAS libraries
The rCUDA middleware allows to seamlessly use, within your cluster, GPUs that are installed in computing nodes different from the one that is executing the CUDA application, without requiring to modify nor recompile your program. Please visit www.rcuda.net for more details about the rCUDA technology.
A radio interferometer indirectly measures the intensity distribution of the sky over the celestial sphere. Since measurements are made over an irregularly sampled Fourier plane, synthesising an intensity image from interferometric measurements requires substantial processing. Furthermore there are distortions that have to be corrected. In this thesis, a new high-performance image synthesis tool (imaging tool) for radio interferometry is developed. Implemented in C++ and CUDA, the imaging tool achieves unprecedented performance by means of Graphics Processing Units (GPUs). The imaging tool is divided into several components, and the back-end handling numerical calculations is generalised in a new framework. A new feature termed compression arbitrarily increases the performance of an already highly efficient GPU-based implementation of the w-projection algorithm. Compression takes advantage of the behaviour of oversampled convolution functions and the baseline trajectories. A CPU-based component prepares data for the GPU which is multi-threaded to ensure maximum use of modern multi-core CPUs. Best performance can only be achieved if all hardware components in a system do work in parallel. The imaging tool is designed such that disk I/O and work on CPU and GPUs is done concurrently. Test cases show that the imaging tool performs nearly 100× faster than another general CPU-based imaging tool. Unfortunately, the tool is limited in use since deconvolution and A-projection are not yet supported. It is also limited by GPU memory. Future work will implement deconvolution and A-projection, whilst finding ways of overcoming the memory limitation.
(Daniel Muscat: “High-Performance Image Synthesis for Radio Interferometry”. Preprint, 2014. [arXiv])
Tau-leaping is a stochastic simulation algorithm that efficiently reconstructs the temporal evolution of biological systems, modeled according to the stochastic formulation of chemical kinetics. The analysis of dynamical properties of these systems in physiological and perturbed conditions usually requires the execution of a large number of simulations, leading to high computational costs. Since each simulation can be executed independently from the others, a massive parallelization of tau-leaping can bring to relevant reductions of the overall running time. The emerging field of General Purpose Graphic Processing Units (GPGPU) provides power-efficient high-performance computing at a relatively low cost. In this work we introduce cuTauLeaping, a stochastic simulator of biological systems that makes use of GPGPU computing to execute multiple parallel tau-leaping simulations, by fully exploiting the Nvidia’s Fermi GPU architecture. We show how a considerable computational speedup is achieved on GPU by partitioning the execution of tau-leaping into multiple separated phases, and we describe how to avoid some implementation pitfalls related to the scarcity of memory resources on the GPU streaming multiprocessors. Our results show that cuTauLeaping largely outperforms the CPU-based tau-leaping implementation when the number of parallel simulations increases, with a break-even directly depending on the size of the biological system and on the complexity of its emergent dynamics. In particular, cuTauLeaping is exploited to investigate the probability distribution of bistable states in the Schlögl model, and to carry out a bidimensional parameter sweep analysis to study the oscillatory regimes in the Ras/cAMP/PKA pathway in S. cerevisiae.
(Nobile M.S., Cazzaniga P., Besozzi D., Pescini D., Mauri G.: “cuTauLeaping: A GPU-Powered Tau-Leaping Stochastic Simulator for Massive Parallel Analyses of Biological Systems”. PLoS ONE 9(3): e91963. [DOI])
Hybrid structure fitting methods combine data from cryo-electron microscopy and X-ray crystallography with molecular dynamics simulations for the determination of all-atom structures of large biomolecular complexes. Evaluating the quality-of-fit obtained from hybrid fitting is computationally demanding, particularly in the context of a multiplicity of structural conformations that must be evaluated. Existing tools for quality-of-fit analysis and visualization have previously targeted small structures and are too slow to be used interactively for large biomolecular complexes of particular interest today such as viruses or for long molecular dynamics trajectories as they arise in protein folding. We present new data-parallel and GPU-accelerated algorithms for rapid interactive computation of quality-of-fit metrics linking all-atom structures and molecular dynamics trajectories to experimentally-determined density maps obtained from cryo-electron microscopy or X-ray crystallography. We evaluate the performance and accuracy of the new quality-of-fit analysis algorithms vis-a-vis existing tools, examine algorithm performance on GPU-accelerated desktop workstations and supercomputers, and describe new visualization techniques for results of hybrid structure fitting methods.
(John E. Stone, Ryan McGreevy, Barry Isralewitz, and Klaus Schulten: “GPU-Accelerated Analysis and Visualization of Large Structures Solved by Molecular Dynamics Flexible Fitting”. Faraday Discussion 169, 2014. [DOI])
This blog post explains GPU Boost, a new user controllable feature available on Tesla GPUs. Case studies and benchmarks for reverse time migration and an electromagnetic solver are discussed.
In this paper, we propose an efficient acceleration method for the nonrigid registration of multimodal images that uses a graphics processing unit (GPU). The key contribution of our method is efficient utilization of on-chip memory for both normalized mutual information (NMI) computation and hierarchical B-spline deformation, which compose a well-known registration algorithm. We implement this registration algorithm as a compute unified device architecture (CUDA) program with an efficient parallel scheme and several optimization techniques such as hierarchical data organization, data reuse, and multiresolution representation. We experimentally evaluate our method with four clinical datasets consisting of up to 512x512x296 voxels. We find that exploitation of onchip memory achieves a 12-fold increase in speed over an off-chip memory version and, therefore, it increases the efficiency of parallel execution from 4% to 46%. We also find that our method running on a GeForce GTX 580 card is approximately 14 times faster than a fully optimized CPU-based implementation running on four cores. Some multimodal registration results are also provided to understand the limitation of our method. We believe that our highly efficient method, which completes an alignment task within a few tens of second, will be useful to realize rapid nonrigid registration.
(Kei Ikeda, Fumihiko Ino, and Kenichi Hagihara: “Efficient Acceleration of Mutual Information Computation for Nonrigid Registration Using CUDA”. Accepted for publication in the IEEE Journal of Biomedical and Health Informatics. [DOI])
As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality systematically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: 1) the GPU’s hierarchy of threads, warps, threadblocks, and sets of active threads, 2) conditional and non-uniform latencies, 3) cache associativity, 4) miss-status holding-registers, and 5) warp divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.
(Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, Henri Bal: “A Detailed GPU Cache Model Based on Reuse Distance Theory”, in High Performance Computer Architecture (HPCA), 2014, [PDF])