The wide majority of current state-of-the-art compressed GPU volume renderers are based on block-transform coding, which is susceptible to blocking artifacts, particularly at low bit-rates. In this paper the authors address the problem for the first time, by introducing a specialized deferred filtering architecture working on block-compressed data and including a novel deblocking algorithm. The architecture efficiently performs high quality shading of massive datasets by closely coordinating visibility- and resolution-aware adaptive data loading with GPU-accelerated per-frame data decompression, deblocking, and rendering. A thorough evaluation including quantitative and qualitative measures demonstrates the performance of our approach on large static and dynamic datasets including a massive 512^4 turbulence simulation (256GB), which is aggressively compressed to less than 2 GB, so as to fully upload it on graphics board and to explore it in real-time during animation.
(Fabio Marton, José Antonio Iglesias Guitián, Jose Díaz and Enrico Gobbetti: “Real-time deblocked GPU rendering of compressed volumes”. Proc. 19th International Workshop on Vision, Modeling and Visualization (VMV), pp. 167-174, Oct. 2014. [WWW])
The introduction of general-purpose Graphics Processing Units (GPUs) is boosting scientific applications in Bioinformatics, Systems Biology, and Computational Biology. In these fields, the use of high-performance computing solutions is motivated by the need of performing large numbers of in silico analysis to study the behavior of biological systems in different conditions, which necessitate a computing power that usually overtakes the capability of standard desktop computers. In this work we present coagSODA, a CUDA-powered computational tool that was purposely developed for the analysis of a large mechanistic model of the blood coagulation cascade (BCC), defined according to both mass-action kinetics and Hill functions. coagSODA allows the execution of parallel simulations of the dynamics of the BCC by automatically deriving the system of ordinary differential equations and then exploiting the numerical integration algorithm LSODA. We present the biological results achieved with a massive exploration of perturbed conditions of the BCC, carried out with one-dimensional and bi-dimensional parameter sweep analysis, and show that GPU-accelerated parallel simulations of this model can increase the computational performances up to a 181× speedup compared to the corresponding sequential simulations.
(Cazzaniga P., Nobile M.S., Besozzi D., Bellini M., Mauri G.: “Massive exploration of perturbed conditions of the blood coagulation cascade through GPU parallelization”. BioMed Research International, vol. 2014. [DOI])
The Web is a constantly expanding global information space that includes disparate types of data and resources. Recent trends demonstrate the urgent need to manage the large amounts of data stream, especially in specific domains of application such as critical infrastructure systems, sensor networks, log file analysis, search engines and more recently, social networks. All of these applications involve large-scale data-intensive tasks, often subject to time constraints and space complexity. Algorithms, data management and data retrieval techniques must be able to process data stream, i.e., process data as it becomes available and provide an accurate response, based solely on the data stream that has already been provided. Data retrieval techniques often require traditional data storage and processing approach, i.e., all data must be available in the storage space in order to be processed. For instance, a widely used relevance measure is Term Frequency–Inverse Document Frequency (TF–IDF), which can evaluate how important a word is in a collection of documents and requires to a priori know the whole dataset.
To address this problem, we propose an approximate version of the TF–IDF measure suitable to work on continuous data stream (such as the exchange of messages, tweets and sensor-based log files). The algorithm for the calculation of this measure makes two assumptions: a fast response is required, and memory is both limited and infinitely smaller than the size of the data stream. In addition, to face the great computational power required to process massive data stream, we present also a parallel implementation of the approximate TF–IDF calculation using Graphical Processing Units (GPUs).
This implementation of the algorithm was tested on generated and real data stream and was able to capture the most frequent terms. Our results demonstrate that the approximate version of the TF–IDF measure performs at a level that is comparable to the solution of the precise TF–IDF measure.
(Ugo Erra, Sabrina Senatore, Fernando Minnella and Giuseppe Caggianese: “Approximate TF-IDF based on topic extraction from massive message stream using the GPU”, Information Sciences 292, pp.141-163, Feb. 2015. [DOI])
A new book titled “Numerical Computations with GPUs” has been published:
This book brings together research on numerical methods adapted for Graphics Processing Units (GPUs). It explains recent efforts to adapt classic numerical methods, including solution of linear equations and FFT, for massively parallel GPU architectures. This volume consolidates recent research and adaptations, covering widely used methods that are at the core of many scientific and engineering computations. Each chapter is written by authors working on a specific group of methods; these leading experts provide mathematical background, parallel algorithms and implementation details leading to reusable, adaptable and scalable code fragments. This book also serves as a GPU implementation manual for many numerical algorithms, sharing tips on GPUs that can increase application efficiency. The valuable insights into parallelization strategies for GPUs are supplemented by ready-to-use code fragments. Numerical Computations with GPUs targets professionals and researchers working in high performance computing and GPU programming. Advanced-level students focused on computer science and mathematics will also find this book useful as secondary text book or reference.
From the table of contents: Read the rest of this entry »
Many current high-performance clusters include one or more GPUs per node in order to dramatically reduce application execution time, but the utilization of these accelerators is usually far below 100%. In this context, emote GPU virtualization can help to reduce acquisition costs as well as the overall energy consumption. In this paper, we investigate the potential overhead and bottlenecks of several “heterogeneous” scenarios consisting of client GPU-less nodes running CUDA applications and remote GPU-equipped server nodes providing access to NVIDIA hardware accelerators. The experimental evaluation is performed using three general-purpose multicore processors (Intel Xeon, Intel Atom and ARM Cortex A9), two graphics accelerators (NVIDIA GeForce GTX480 and NVIDIA Quadro M1000), and two relevant scientific applications (CUDASW++ and LAMMPS) arising in bioinformatics and molecular dynamics simulations.
(A. Castelló, J. Duato, R. Mayo, A. J. Peña, E. S. Quintana-Ortí, V. Roca, and F. Silla, “On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications”. Fourth International Conference on Smart Grids, Green Communications and IT Energy-aware Technologies, ENERGY 2014, Chamonix (France), pp. 57–62, 20 – 24 April 2014. [PDF])
We present a cache-aware method for accelerating texture-based volume rendering on a graphics processing unit (GPU). Because a GPU has hierarchical architecture in terms of processing and memory units, cache optimization is important to maximize performance for memory-intensive applications. Our method localizes texture memory reference according to the location of the viewpoint and dynamically selects the width and height of thread blocks (TBs) so that each warp, which is a series of 32 threads processed simultaneously, can minimize memory access strides. We also incorporate transposed indexing of threads to perform TB-level cache optimization for specific viewpoints. Furthermore, we maximize TB size to exploit spatial locality with fewer resident TBs. For viewpoints with relatively large strides, we synchronize threads of the same TB at regular intervals to realize synchronous ray propagation. Experimental results indicate that our cache-aware method doubles the worst rendering performance compared to those provided by the CUDA and OpenCL software development kits.
(Yuki Sugimoto, Fumihiko Ino, and Kenichi Hagihara: “Improving Cache Locality for GPU-based Volume Rendering”. Parallel Computing 40(5/6): 59-69, May 2014. [DOI])
Analysis of functional magnetic resonance imaging (fMRI) data is becoming ever more computationally demanding as temporal and spatial resolutions improve, and large, publicly available data sets proliferate. Moreover, methodological improvements in the neuroimaging pipeline, such as non-linear spatial normalization, non-parametric permutation tests and Bayesian Markov Chain Monte Carlo approaches, can dramatically increase the computational burden. Despite these challenges, there do not yet exist any fMRI software packages which leverage inexpensive and powerful GPUs to perform these analyses. Here, we therefore present BROCCOLI, a free software package written in OpenCL that can be used for parallel analysis of fMRI data on a large variety of hardware configurations. BROCCOLI has, for example, been tested with an Intel CPU, an Nvidia GPU, and an AMD GPU. These tests show that parallel processing of fMRI data can lead to significantly faster analysis pipelines. This speedup can be achieved on relatively standard hardware, but further speed improvements require only a modest investment in GPU hardware. BROCCOLI (running on a GPU) can perform non-linear spatial normalization to a 1 mm3 brain template in 4–6 s, and run a second level permutation test with 10,000 permutations in about a minute. These non-parametric tests are generally more robust than their parametric counterparts, and can also enable more sophisticated analyses by estimating complicated null distributions. Additionally, BROCCOLI includes support for Bayesian first-level fMRI analysis using a Gibbs sampler. The new software is freely available under GNU GPL3 and can be downloaded from github: https://github.com/wanderine/BROCCOLI.
(A. Eklund, P. Dufort, M. Villani and S. LaConte: “BROCCOLI: Software for fast fMRI analysis on many-core CPUs and GPUs”. Front. Neuroinform. 8:24, 2014. [DOI])
Frequent itemset mining (FIM) is a core area for many data mining applications as association rules computation, clustering and correlations, which has been comprehensively studied over the last decades. Furthermore, databases are becoming gradually larger, thus requiring a higher computing power to mine them in reasonable time. At the same time, the improvements in high performance computing platforms are transforming them into massively parallel environments equipped with multi-core processors, such as GPUs. Hence, fully operating these systems to perform itemset mining poses as a challenging and critical problems that addressed by various researcher. We present survey of multi-core and GPU accelerated parallelization of the FIM algorithms.
(Dharmesh Bhalodiya and Chhaya patel: “Comparative Study of Frequent Itemset Mining Techniques on Graphics Processor”. International Journal of Engineering Research and Applications 4(4):159-163, April 2014. [PDF])
Spectral unmixing is an important task in remotely sensed hyperspectral data exploitation. The linear mixture model has been widely used to unmix hyperspectral images by identifying a set of pure spectral signatures, called endmembers, and estimating their respective abundances in each pixel of the scene. Several algorithms have been proposed in the recent literature to automatically identify endmembers, even if the original hyperspectral scene does not contain any pure signatures. A popular strategy for endmember identification in highly mixed hyperspectral scenes has been the minimum volume simplex analysis (MVSA), known to be a computationally very expensive algorithm. This algorithm calculates the minimum volume enclosing simplex, as opposed to other algorithms that perform maximum simplex volume analysis (MSVA). The high computational complexity of MVSA, together with its very high memory requirements, has limited its adoption in the hyperspectral imaging community. In this paper we develop several optimizations to the MVSA algorithm. The main computational task of MVSA is the solution of a quadratic optimization problem with equality and inequality constraints, with the inequality constraints being in the order of the number of pixels multiplied by the number of endmembers. As a result, storing and computing the inequality constraint matrix is highly inefficient. The first optimization presented in this paper uses algebra operations in order to reduce the memory requirements of the algorithm. In the second optimization, we use graphics processing units (GPUs) to effectively solve (in parallel) the quadratic optimization problem involved in the computation of MVSA. In the third optimization, we extend the single GPU implementation to a multi-GPU one, developing a hybrid strategy that distributes the computation while taking advantage of GPU accelerators at each node. The presented optimizations are tested in different analysis scenarios (using both synthetic and real hyperspectral data) and shown to provide state-of-the-art results from the viewpoint of unmixing accuracy and computational performance. The speedup achieved using the full GPU cluster compared to the CPU implementation in tenfold in a real hyperspectral image.
(A. Agathos, J. Li, D. Petcu and A. Plaza: “Multi-GPU Implementation of the Minimum Volume Simplex Analysis Algorithm for Hyperspectral Unmixing”. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, accepted for publication , 2014. [PDF] )
LDPC decoding process is known as compute intensive. This kind of digital communication applications was recently implemented onto GPU devices for LDPC code performance estimation and/or for real-time measurements. Overall previous studies about LDPC decoding on GPU were based on the implementation of the flooding-based decoding algorithm that provides massive computation parallelism. More efficient layered schedules were proposed in literature because decoder iteration can be split into sub-layer iterations. These schedules seem to badly fit onto GPU devices due to restricted computation parallelism and complex memory access patterns. However, the layered schedules enable the decoding convergence to speed up by two. In this letter, we show that (a) layered schedule can be efficiently implemented onto a GPU device (b) this approach – implemented onto a low-cost GPU device – provides higher throughputs with identical correction performances (BER) compared to previously published results.
(B. Le Gal, C. Jégo and J. Crenne: “An high-throughput efficiency approach for GPU-based LDPC decoding”. IEEE Embedded System Letters, March 2014. [DOI])