The Web is a constantly expanding global information space that includes disparate types of data and resources. Recent trends demonstrate the urgent need to manage the large amounts of data stream, especially in specific domains of application such as critical infrastructure systems, sensor networks, log file analysis, search engines and more recently, social networks. All of these applications involve large-scale data-intensive tasks, often subject to time constraints and space complexity. Algorithms, data management and data retrieval techniques must be able to process data stream, i.e., process data as it becomes available and provide an accurate response, based solely on the data stream that has already been provided. Data retrieval techniques often require traditional data storage and processing approach, i.e., all data must be available in the storage space in order to be processed. For instance, a widely used relevance measure is Term Frequency–Inverse Document Frequency (TF–IDF), which can evaluate how important a word is in a collection of documents and requires to a priori know the whole dataset.
To address this problem, we propose an approximate version of the TF–IDF measure suitable to work on continuous data stream (such as the exchange of messages, tweets and sensor-based log files). The algorithm for the calculation of this measure makes two assumptions: a fast response is required, and memory is both limited and infinitely smaller than the size of the data stream. In addition, to face the great computational power required to process massive data stream, we present also a parallel implementation of the approximate TF–IDF calculation using Graphical Processing Units (GPUs).
This implementation of the algorithm was tested on generated and real data stream and was able to capture the most frequent terms. Our results demonstrate that the approximate version of the TF–IDF measure performs at a level that is comparable to the solution of the precise TF–IDF measure.
(Ugo Erra, Sabrina Senatore, Fernando Minnella and Giuseppe Caggianese: “Approximate TF-IDF based on topic extraction from massive message stream using the GPU”, Information Sciences 292, pp.141-163, Feb. 2015. [DOI])
A new book titled “Numerical Computations with GPUs” has been published:
This book brings together research on numerical methods adapted for Graphics Processing Units (GPUs). It explains recent efforts to adapt classic numerical methods, including solution of linear equations and FFT, for massively parallel GPU architectures. This volume consolidates recent research and adaptations, covering widely used methods that are at the core of many scientific and engineering computations. Each chapter is written by authors working on a specific group of methods; these leading experts provide mathematical background, parallel algorithms and implementation details leading to reusable, adaptable and scalable code fragments. This book also serves as a GPU implementation manual for many numerical algorithms, sharing tips on GPUs that can increase application efficiency. The valuable insights into parallelization strategies for GPUs are supplemented by ready-to-use code fragments. Numerical Computations with GPUs targets professionals and researchers working in high performance computing and GPU programming. Advanced-level students focused on computer science and mathematics will also find this book useful as secondary text book or reference.
From the table of contents: Read the rest of this entry »
Many current high-performance clusters include one or more GPUs per node in order to dramatically reduce application execution time, but the utilization of these accelerators is usually far below 100%. In this context, emote GPU virtualization can help to reduce acquisition costs as well as the overall energy consumption. In this paper, we investigate the potential overhead and bottlenecks of several “heterogeneous” scenarios consisting of client GPU-less nodes running CUDA applications and remote GPU-equipped server nodes providing access to NVIDIA hardware accelerators. The experimental evaluation is performed using three general-purpose multicore processors (Intel Xeon, Intel Atom and ARM Cortex A9), two graphics accelerators (NVIDIA GeForce GTX480 and NVIDIA Quadro M1000), and two relevant scientific applications (CUDASW++ and LAMMPS) arising in bioinformatics and molecular dynamics simulations.
(A. Castelló, J. Duato, R. Mayo, A. J. Peña, E. S. Quintana-Ortí, V. Roca, and F. Silla, “On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications”. Fourth International Conference on Smart Grids, Green Communications and IT Energy-aware Technologies, ENERGY 2014, Chamonix (France), pp. 57–62, 20 – 24 April 2014. [PDF])
We present a cache-aware method for accelerating texture-based volume rendering on a graphics processing unit (GPU). Because a GPU has hierarchical architecture in terms of processing and memory units, cache optimization is important to maximize performance for memory-intensive applications. Our method localizes texture memory reference according to the location of the viewpoint and dynamically selects the width and height of thread blocks (TBs) so that each warp, which is a series of 32 threads processed simultaneously, can minimize memory access strides. We also incorporate transposed indexing of threads to perform TB-level cache optimization for specific viewpoints. Furthermore, we maximize TB size to exploit spatial locality with fewer resident TBs. For viewpoints with relatively large strides, we synchronize threads of the same TB at regular intervals to realize synchronous ray propagation. Experimental results indicate that our cache-aware method doubles the worst rendering performance compared to those provided by the CUDA and OpenCL software development kits.
(Yuki Sugimoto, Fumihiko Ino, and Kenichi Hagihara: “Improving Cache Locality for GPU-based Volume Rendering”. Parallel Computing 40(5/6): 59-69, May 2014. [DOI])
Analysis of functional magnetic resonance imaging (fMRI) data is becoming ever more computationally demanding as temporal and spatial resolutions improve, and large, publicly available data sets proliferate. Moreover, methodological improvements in the neuroimaging pipeline, such as non-linear spatial normalization, non-parametric permutation tests and Bayesian Markov Chain Monte Carlo approaches, can dramatically increase the computational burden. Despite these challenges, there do not yet exist any fMRI software packages which leverage inexpensive and powerful GPUs to perform these analyses. Here, we therefore present BROCCOLI, a free software package written in OpenCL that can be used for parallel analysis of fMRI data on a large variety of hardware configurations. BROCCOLI has, for example, been tested with an Intel CPU, an Nvidia GPU, and an AMD GPU. These tests show that parallel processing of fMRI data can lead to significantly faster analysis pipelines. This speedup can be achieved on relatively standard hardware, but further speed improvements require only a modest investment in GPU hardware. BROCCOLI (running on a GPU) can perform non-linear spatial normalization to a 1 mm3 brain template in 4–6 s, and run a second level permutation test with 10,000 permutations in about a minute. These non-parametric tests are generally more robust than their parametric counterparts, and can also enable more sophisticated analyses by estimating complicated null distributions. Additionally, BROCCOLI includes support for Bayesian first-level fMRI analysis using a Gibbs sampler. The new software is freely available under GNU GPL3 and can be downloaded from github: https://github.com/wanderine/BROCCOLI.
(A. Eklund, P. Dufort, M. Villani and S. LaConte: “BROCCOLI: Software for fast fMRI analysis on many-core CPUs and GPUs”. Front. Neuroinform. 8:24, 2014. [DOI])
Frequent itemset mining (FIM) is a core area for many data mining applications as association rules computation, clustering and correlations, which has been comprehensively studied over the last decades. Furthermore, databases are becoming gradually larger, thus requiring a higher computing power to mine them in reasonable time. At the same time, the improvements in high performance computing platforms are transforming them into massively parallel environments equipped with multi-core processors, such as GPUs. Hence, fully operating these systems to perform itemset mining poses as a challenging and critical problems that addressed by various researcher. We present survey of multi-core and GPU accelerated parallelization of the FIM algorithms.
(Dharmesh Bhalodiya and Chhaya patel: “Comparative Study of Frequent Itemset Mining Techniques on Graphics Processor”. International Journal of Engineering Research and Applications 4(4):159-163, April 2014. [PDF])
Spectral unmixing is an important task in remotely sensed hyperspectral data exploitation. The linear mixture model has been widely used to unmix hyperspectral images by identifying a set of pure spectral signatures, called endmembers, and estimating their respective abundances in each pixel of the scene. Several algorithms have been proposed in the recent literature to automatically identify endmembers, even if the original hyperspectral scene does not contain any pure signatures. A popular strategy for endmember identification in highly mixed hyperspectral scenes has been the minimum volume simplex analysis (MVSA), known to be a computationally very expensive algorithm. This algorithm calculates the minimum volume enclosing simplex, as opposed to other algorithms that perform maximum simplex volume analysis (MSVA). The high computational complexity of MVSA, together with its very high memory requirements, has limited its adoption in the hyperspectral imaging community. In this paper we develop several optimizations to the MVSA algorithm. The main computational task of MVSA is the solution of a quadratic optimization problem with equality and inequality constraints, with the inequality constraints being in the order of the number of pixels multiplied by the number of endmembers. As a result, storing and computing the inequality constraint matrix is highly inefficient. The first optimization presented in this paper uses algebra operations in order to reduce the memory requirements of the algorithm. In the second optimization, we use graphics processing units (GPUs) to effectively solve (in parallel) the quadratic optimization problem involved in the computation of MVSA. In the third optimization, we extend the single GPU implementation to a multi-GPU one, developing a hybrid strategy that distributes the computation while taking advantage of GPU accelerators at each node. The presented optimizations are tested in different analysis scenarios (using both synthetic and real hyperspectral data) and shown to provide state-of-the-art results from the viewpoint of unmixing accuracy and computational performance. The speedup achieved using the full GPU cluster compared to the CPU implementation in tenfold in a real hyperspectral image.
(A. Agathos, J. Li, D. Petcu and A. Plaza: “Multi-GPU Implementation of the Minimum Volume Simplex Analysis Algorithm for Hyperspectral Unmixing”. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, accepted for publication , 2014. [PDF] )
LDPC decoding process is known as compute intensive. This kind of digital communication applications was recently implemented onto GPU devices for LDPC code performance estimation and/or for real-time measurements. Overall previous studies about LDPC decoding on GPU were based on the implementation of the flooding-based decoding algorithm that provides massive computation parallelism. More efficient layered schedules were proposed in literature because decoder iteration can be split into sub-layer iterations. These schedules seem to badly fit onto GPU devices due to restricted computation parallelism and complex memory access patterns. However, the layered schedules enable the decoding convergence to speed up by two. In this letter, we show that (a) layered schedule can be efficiently implemented onto a GPU device (b) this approach – implemented onto a low-cost GPU device – provides higher throughputs with identical correction performances (BER) compared to previously published results.
(B. Le Gal, C. Jégo and J. Crenne: “An high-throughput efficiency approach for GPU-based LDPC decoding”. IEEE Embedded System Letters, March 2014. [DOI])
We describe a new algorithm for solving the all-pairs shortest-path (APSP) problem for planar graphs and graphs with small separators that exploits the massive on-chip parallelism available in today’s Graphics Processing Units (GPUs). Our algorithm, based on the Floyd-Warshall algorithm, has near optimal complexity in terms of the total number of operations, while its matrix-based structure is regular enough to allow for efficient parallel implementation on the GPUs. By applying a divide-and-conquer approach, we are able to make use of multi-node GPU clusters, resulting in more than an order of magnitude speedup over the fastest known Dijkstra-based GPU implementation and a two-fold speedup over a parallel Dijkstra-based CPU implementation.
(Hristo Djidjev, Sunil Thulasidasan, Guillaume Chapuis, Rumen Andonov and Dominique Lavenier: “Efficient Multi-GPU Computation of All-Pairs Shortest Paths”. To appear in IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2014. [PDF])
A radio interferometer indirectly measures the intensity distribution of the sky over the celestial sphere. Since measurements are made over an irregularly sampled Fourier plane, synthesising an intensity image from interferometric measurements requires substantial processing. Furthermore there are distortions that have to be corrected. In this thesis, a new high-performance image synthesis tool (imaging tool) for radio interferometry is developed. Implemented in C++ and CUDA, the imaging tool achieves unprecedented performance by means of Graphics Processing Units (GPUs). The imaging tool is divided into several components, and the back-end handling numerical calculations is generalised in a new framework. A new feature termed compression arbitrarily increases the performance of an already highly efficient GPU-based implementation of the w-projection algorithm. Compression takes advantage of the behaviour of oversampled convolution functions and the baseline trajectories. A CPU-based component prepares data for the GPU which is multi-threaded to ensure maximum use of modern multi-core CPUs. Best performance can only be achieved if all hardware components in a system do work in parallel. The imaging tool is designed such that disk I/O and work on CPU and GPUs is done concurrently. Test cases show that the imaging tool performs nearly 100× faster than another general CPU-based imaging tool. Unfortunately, the tool is limited in use since deconvolution and A-projection are not yet supported. It is also limited by GPU memory. Future work will implement deconvolution and A-projection, whilst finding ways of overcoming the memory limitation.
(Daniel Muscat: “High-Performance Image Synthesis for Radio Interferometry”. Preprint, 2014. [arXiv])