Since 2011, the most powerful supercomputers systems ranked in the Top500 list have been hybrid systems composed of thousands of nodes that includes CPUs and accelerators, as Xeon Phi and GPUs. Programming and deploying applications on those systems is still a challenge due to complexity of the system and the need to mix several programming interfaces (MPI, CUDA, Intel Xeon Phi) in the same application. This special issue of the *International Journal of Computers & Electrical Engineering* is aimed at exploring the state of the art of developing applications in accelerated massive HPC architectures, including practical issues of hybrid usage models with MPI, OpenMP, and other accelerators programming models. The idea is to publish novel work on the use of available programming interfaces (MPI, CUDA, Intel Xeon Phi) and tools for code development, application performance optimizations, application deployment on accelerated systems, as well as the advantages and limitations of accelerated HPC systems. Experiences with real-world applications, including scientific computing, numerical simulations, healthcare, energy, data-analysis, etc. are also encouraged.

## CfP: Optimization of Parallel Scientific Applications with Accelerated HPC

October 29th, 2014## CfP: GPGPU 2015

October 29th, 2014The goal of this workshop is to provide a forum to discuss new and emerging general-purpose purpose programming environments and platforms, as well as evaluate applications that have been able to harness the horsepower provided by these platforms. This year’s work is particularly interested on new heterogeneous GPU platforms, new forms of concurrency, and novel/irregular applications that can leverage these platforms. Papers are being sought on many aspects of GPUs, including (but not limited to): Read the rest of this entry »

## Approximate TF–IDF based on topic extraction from massive message stream using the GPU

October 16th, 2014Abstract:

The Web is a constantly expanding global information space that includes disparate types of data and resources. Recent trends demonstrate the urgent need to manage the large amounts of data stream, especially in specific domains of application such as critical infrastructure systems, sensor networks, log file analysis, search engines and more recently, social networks. All of these applications involve large-scale data-intensive tasks, often subject to time constraints and space complexity. Algorithms, data management and data retrieval techniques must be able to process data stream, i.e., process data as it becomes available and provide an accurate response, based solely on the data stream that has already been provided. Data retrieval techniques often require traditional data storage and processing approach, i.e., all data must be available in the storage space in order to be processed. For instance, a widely used relevance measure is Term Frequency–Inverse Document Frequency (TF–IDF), which can evaluate how important a word is in a collection of documents and requires to a priori know the whole dataset.

To address this problem, we propose an approximate version of the TF–IDF measure suitable to work on continuous data stream (such as the exchange of messages, tweets and sensor-based log files). The algorithm for the calculation of this measure makes two assumptions: a fast response is required, and memory is both limited and infinitely smaller than the size of the data stream. In addition, to face the great computational power required to process massive data stream, we present also a parallel implementation of the approximate TF–IDF calculation using Graphical Processing Units (GPUs).

This implementation of the algorithm was tested on generated and real data stream and was able to capture the most frequent terms. Our results demonstrate that the approximate version of the TF–IDF measure performs at a level that is comparable to the solution of the precise TF–IDF measure.

(Ugo Erra, Sabrina Senatore, Fernando Minnella and Giuseppe Caggianese: *“Approximate TF-IDF based on topic extraction from massive message stream using the GPU”*, Information Sciences 292, pp.141-163, Feb. 2015. [DOI])

## CfP: 23rd High Performance Computing Symposium

September 5th, 2014The 23rd High Performance Computing Symposium (April 12-15, 2015 in Alexandria, VA, USA) is devoted to the impact of high performance computing and communications on computer simulations. Topics of interest include:

- GPU for general purpose computations (GPGPU)
- Hybrid system modeling and simulation
- Tools and environments for coupling parallel codes
- Parallel algorithms and architectures
- High performance software tools

Submission deadline for full papers: November 22, 2014. More information can be found at http://hosting.cs.vt.edu/hpc2015.

## Accelerated Combinatorial Optimization using Graphics Processing Units and C++ AMP

August 20th, 2014Abstract:

In the course of less than a decade, Graphics Processing Units (GPUs) have evolved from narrowly scoped application specific accelerators to general-purpose parallel machines capable of accommodating an ever-growing set of algorithms. At the same time, programming GPUs appears to have become trapped around an attractor characterised by ad-hoc practices, non-portable implementations and inexact, uninformative performance reporting. The purpose of this paper is two-fold, on one hand pursuing an in-depth look at GPU hardware and its characteristics, and on the other demonstrating that portable, generic, mathematically grounded programming of these machines is possible and desirable. An agent-based meta-heuristic, the Max-Min Ant System (MMAS), provides the context. The major contributions brought about by this article are the following: (1) an optimal, portable, generic-algorithm based MMAS implementation is derived; (2) an in-depth analysis of AMD’s Graphics Core Next (GCN) GPU and the C++ AMP programming model is supplied; (3) a more robust approach to performance reporting is presented; (4) novel techniques for raising the abstraction level without sacrificing performance are employed. This represents the first implementation of an algorithm from the Ant Colony Optimisation (ACO) family using C++ AMP, whilst at the same time being one of the first uses of the latter programming environment.

(A. Voicu: *“Accelerated Combinatorial Optimization using Graphics Processing Units and C++ AMP ”*. International Journal of Computer Applications 100(6):21-30, August 2014. [DOI])

## Call for Contributions: GPU algorithms for image processing and computer vision

July 22nd, 2014*“GPU Algorithms for Image Processing and Computer Vision”*, to be published by Springer, will contain a collection of articles on fundamental image processing and computer vision methods adapted for Graphics Processing Units (GPUs). In recent years, substantial efforts were undertaken to adapt many such algorithms for massively-parallel GPU-based systems. The book is envisioned as a consolidation of such work into a single volume covering widely used methods and techniques. Each chapter will be written by authors working on a specific group of methods. It will provide mathematical background, parallel algorithm, and implementation details leading to reusable, adaptable, and scalable code fragments. The book will serve as a GPU implementation manual for many image processing and analysis algorithms providing valuable insights into parallelization strategies for GPUs as well as ready-to-use code fragments with a broad appeal to both developers and researchers interested in GPU computing. Read the rest of this entry »

## On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications

June 8th, 2014Abstract:

Many current high-performance clusters include one or more GPUs per node in order to dramatically reduce application execution time, but the utilization of these accelerators is usually far below 100%. In this context, emote GPU virtualization can help to reduce acquisition costs as well as the overall energy consumption. In this paper, we investigate the potential overhead and bottlenecks of several “heterogeneous” scenarios consisting of client GPU-less nodes running CUDA applications and remote GPU-equipped server nodes providing access to NVIDIA hardware accelerators. The experimental evaluation is performed using three general-purpose multicore processors (Intel Xeon, Intel Atom and ARM Cortex A9), two graphics accelerators (NVIDIA GeForce GTX480 and NVIDIA Quadro M1000), and two relevant scientific applications (CUDASW++ and LAMMPS) arising in bioinformatics and molecular dynamics simulations.

(A. Castelló, J. Duato, R. Mayo, A. J. Peña, E. S. Quintana-Ortí, V. Roca, and F. Silla, *“On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications”*. Fourth International Conference on Smart Grids, Green Communications and IT Energy-aware Technologies, ENERGY 2014, Chamonix (France), pp. 57–62, 20 – 24 April 2014. [PDF])

## Improving Cache Locality for GPU-based Volume Rendering

June 8th, 2014Abstract:

We present a cache-aware method for accelerating texture-based volume rendering on a graphics processing unit (GPU). Because a GPU has hierarchical architecture in terms of processing and memory units, cache optimization is important to maximize performance for memory-intensive applications. Our method localizes texture memory reference according to the location of the viewpoint and dynamically selects the width and height of thread blocks (TBs) so that each warp, which is a series of 32 threads processed simultaneously, can minimize memory access strides. We also incorporate transposed indexing of threads to perform TB-level cache optimization for specific viewpoints. Furthermore, we maximize TB size to exploit spatial locality with fewer resident TBs. For viewpoints with relatively large strides, we synchronize threads of the same TB at regular intervals to realize synchronous ray propagation. Experimental results indicate that our cache-aware method doubles the worst rendering performance compared to those provided by the CUDA and OpenCL software development kits.

(Yuki Sugimoto, Fumihiko Ino, and Kenichi Hagihara: *“Improving Cache Locality for GPU-based Volume Rendering”*. Parallel Computing 40(5/6): 59-69, May 2014. [DOI])

## CfP: GPU in High Energy Physics 2014

June 3rd, 2014The conference focuses on the application of GPUs in High Energy Physics (HEP), expanding on the trend of previous workshops on the topic and pointing to establishing a recurrent series. The emerging paradigm of the use of graphic processors as powerful accelerators in data- and computation-intensive applications found fertile ground in the computing challenges of the HEP community and is currently object of active investigations. This follows a long established trend which sees the increased use of cheap off-the-shelf commercial units to achieve unprecedented performances in parallel data processing, thus leveraging on a very strong commitment of hardware producers to the huge market of computer graphics and games. These hardware advances comes together with the continuous development of proprietary and free software to expose the raw computing power of GPUs for general-purpose applications and scientific computing in particular. All different applications of massively parallel computing in HEP will be addressed, from computational speed-ups in online and offline data selection and analysis to hard real-time applications in low-level triggering, to MonteCarlo simulations for lattice QCD. Both current activities and plans for foreseen experiments and projects will be discussed, together with perspectives on the evolution of the hardware and software.

The conference is held in Pisa (Italy), 10.9.2014 – 12.9.2014. More information: http://www.pi.infn.it/gpu2014

## BROCCOLI: Software for fast fMRI analysis on many-core CPUs and GPUs

May 27th, 2014Abstract:

Analysis of functional magnetic resonance imaging (fMRI) data is becoming ever more computationally demanding as temporal and spatial resolutions improve, and large, publicly available data sets proliferate. Moreover, methodological improvements in the neuroimaging pipeline, such as non-linear spatial normalization, non-parametric permutation tests and Bayesian Markov Chain Monte Carlo approaches, can dramatically increase the computational burden. Despite these challenges, there do not yet exist any fMRI software packages which leverage inexpensive and powerful GPUs to perform these analyses. Here, we therefore present BROCCOLI, a free software package written in OpenCL that can be used for parallel analysis of fMRI data on a large variety of hardware configurations. BROCCOLI has, for example, been tested with an Intel CPU, an Nvidia GPU, and an AMD GPU. These tests show that parallel processing of fMRI data can lead to significantly faster analysis pipelines. This speedup can be achieved on relatively standard hardware, but further speed improvements require only a modest investment in GPU hardware. BROCCOLI (running on a GPU) can perform non-linear spatial normalization to a 1 mm3 brain template in 4–6 s, and run a second level permutation test with 10,000 permutations in about a minute. These non-parametric tests are generally more robust than their parametric counterparts, and can also enable more sophisticated analyses by estimating complicated null distributions. Additionally, BROCCOLI includes support for Bayesian first-level fMRI analysis using a Gibbs sampler. The new software is freely available under GNU GPL3 and can be downloaded from github: https://github.com/wanderine/BROCCOLI.

(A. Eklund, P. Dufort, M. Villani and S. LaConte: *“BROCCOLI: Software for fast fMRI analysis on many-core CPUs and GPUs”*. Front. Neuroinform. 8:24, 2014. [DOI])