March 5th, 2014
March 5th, 2014
As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality systematically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: 1) the GPU’s hierarchy of threads, warps, threadblocks, and sets of active threads, 2) conditional and non-uniform latencies, 3) cache associativity, 4) miss-status holding-registers, and 5) warp divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.
(Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, Henri Bal: “A Detailed GPU Cache Model Based on Reuse Distance Theory”, in High Performance Computer Architecture (HPCA), 2014, [PDF])
February 17th, 2014
Petascale supercomputers create new opportunities for the study of the structure and function of large biomolecular complexes such as viruses and photosynthetic organelles, permitting all-atom molecular dynamics simulations of tens to hundreds of millions of atoms. Together with simulation and analysis, visualization provides researchers with a powerful “computational microscope”. Petascale molecular dynamics simulations produce tens to hundreds of terabytes of data that can be impractical to transfer to remote facilities, making it necessary to perform visualization and analysis tasks in-place on the supercomputer where the data are generated. We describe the adaptation of key visualization features of VMD, a widely used molecular visualization and analysis tool, for GPU-accelerated petascale computers. We discuss early experiences adapting ray tracing algorithms for GPUs, and compare rendering performance for recent petascale molecular simulation test cases on Cray XE6 (CPU-only) and XK7 (GPU-accelerated) compute nodes. Finally, we highlight opportunities for further algorithmic improvements and optimizations.
(John E. Stone, Kirby L. Vandivort, and Klaus Schulten: “GPU-Accelerated Molecular Visualization on Petascale Supercomputing Platforms”. UltraVis’13: Proceedings of the 8th International Workshop on Ultrascale Visualization, pp. 6:1-6:8, 2013. [DOI])
February 17th, 2014
High-Performance Graphics is the leading international forum for performance-oriented graphics and imaging systems research including innovative algorithms, efficient implementations, languages, parallelism, compilers, parallelism, hardware and architectures for high-performance graphics. High-Performance Graphics was founded in 2009, synthesizing multiple conferences to bring together researchers, engineers, and architects to discuss the complex interactions of parallel hardware, novel programming models, and efficient algorithms in the design of systems for current and future graphics and visual computing applications.
HPC is co-located with EGSR, in Lyon, France, June 23-25, 2014. More information: http://www.highperformancegraphics.org/2014
February 2nd, 2014
High performance of modern Graphics Processing Units may be utilized not only for graphics related application but also for general computing. This computing power has been utilized in new variants of many algorithms from almost every computer science domain. Unfortunately, while other application domains strongly benefit from utilizing the GPUs, databases related applications seem not to get enough attention. The main goal of GPUs in Databases workshop is to fill this gap. This event is devoted to sharing the knowledge related to applying GPUs in Database environments and to discuss possible future development of this application domain.
ADBIS workshop on GPUs In Databases GID 2014, September 7th, 2014, Ohrid, Republic of Macedonia. More information: http://gid.us.to
November 20th, 2013
OpenCLIPP is a library providing processing primitives (image processing primitives in the first version) implemented with OpenCL for fast execution on dedicated computing devices like GPUs. Two interfaces are provided: C (similar to the Intel IPP and NVIDIA NPP libraries) and C++. OpenCLIPP is free for personal and commercial use. It can be downloaded from GitHub.
M. Akhloufi, A. Campagna, “OpenCLIPP: OpenCL Integrated Performance Primitives library for computer vision applications”, Proc. SPIE Electronic Imaging 2014, Intelligent Robots and Computer Vision XXXI: Algorithms and Techniques, P. 9025-31, February 2014.
October 19th, 2013
GPGPU-7 (Seventh Workshop on General Purpose Processing Using GPUs) is held in conjunction with ASPLOS in Salt Lake City, Utah, on March 1st, 2014. The goal of this workshop is to provide a forum to discuss new and emerging general-purpose purpose programming environments and platforms, as well as evaluate applications that have been able to harness the horsepower provided by these platforms.
This year’s work is particularly interested on new heterogeneous GPU platforms. Read the rest of this entry »
October 19th, 2013
We present a GPU-based streaming algorithm to perform high-resolution and accurate cloth simulation. We map all the components of cloth simulation pipeline, including time integration, collision detection, collision response, and velocity updating to GPU-based kernels and data structures. Our algorithm perform intra-object and interobject collisions, handles contacts and friction, and is able to accurately simulate folds and wrinkles. We describe the streaming pipeline and address many issues in terms of obtaining high throughput on many-core GPUs. In practice, our algorithm can perform high-fidelity simulation on a cloth mesh with 2M triangles using 3GB of GPU memory. We highlight the parallel performance of our algorithm on three different generations of GPUs. On a high-end NVIDIA Tesla K20c, we observe up to two orders of magnitude performance improvement as compared to a single-threaded CPU-based algorithm, and about one order of magnitude improvement over a 16-core CPUbased parallel implementation.
(Min Tang, Roufeng Tong, Rahul Narain, Chang Meng and Dinesh Manocha: “A GPU-based Streaming Algorithm for High-Resolution Cloth Simulation”, in the Proceedings of Pacific Graphics 2013. [WWW])
October 19th, 2013
The computational investigation of a biological system often requires the execution of a large number of simulations to analyze its dynamics, and to derive useful knowledge on its behavior under physiological and perturbed conditions. This analysis usually turns out into very high computational costs when simulations are run on central processing units (CPUs), therefore demanding a shift to the use of high-performance processors. In this work we present a simulator of biological systems, called cupSODA, which exploits the higher memory bandwidth and computational capability of graphics processing units (GPUs). This software allows to execute parallel simulations of the dynamics of biological systems, by first deriving a set of ordinary differential equations from reaction-based mechanistic models defined according to the mass-action kinetics, and then exploiting the numerical integration algorithm LSODA. We show that cupSODA can achieve a 112× speedup on GPUs with respect to equivalent executions of LSODA on CPUs.
(Nobile M.S., Besozzi D., Cazzaniga P., Mauri G., Pescini D.: “cupSODA: a CUDA-Powered Simulator of Mass-action Kinetics”, In 12th International Conference on Parallel Computing Technologies (PaCT), Lecture Notes in Computer Science, volume 7979, pp. 344-357, 2013. [DOI])
October 7th, 2013
PARALUTION is a library for sparse iterative methods which can be performed on various parallel devices, including multi-core CPU and GPU. In the new 0.4.0 version, the library provides also a backend for Xeon Phi (MIC). With this new version, various performance benchmarks based on vector-vector routines, sparse matrix-vector multiplication and CG method on different backends have been released: OpenMP/CUDA/OpenCL- NVIDIA GPU, AMD GPU, CPU and Xeon Phi. More information: http://www.paralution.com/benchmarks/
The use of GPUs to accelerate general-purpose scientific and engineering applications is mainstream today, but their adoption in current high-performance computing clusters is impaired primarily by acquisition costs and power consumption. Therefore, the benefits of sharing a reduced number of GPUs among all the nodes of a cluster can be remarkable for many applications. This approach, usually referred to as remote GPU virtualization, aims at reducing the number of GPUs present in a cluster, while increasing their utilization rate. The performance of the interconnection network is key to achieving reasonable performance results by means of remote GPU virtualization. To this end, several networking technologies with throughput comparable to that of PCI Express have appeared recently. In this paper we analyze the influence of InfiniBand FDR on the performance of remote GPU virtualization, comparing its impact on a variety of GPU-accelerated applications with other networking technologies, such as InfiniBand QDR and Gigabit Ethernet. Given the severe limitations of freely available remote GPU virtualization solutions, the rCUDA framework is used as the case study for this analysis. Results show that the new FDR interconnect, featuring higher bandwidth than its predecessors, allows the reduction of the overhead of using GPUs remotely, thus making this approach even more appealing.
(Carlos Reano, Rafael Mayo, Enrique S. Quintana-Ortí, Federico Silla, José Duato and Antonio J. Pena: “Influence of InfiniBand FDR on the Performance of Remote GPU Virtualization”. Proceedings of the IEEE Cluster 2013 Conference, Indianapolis, USA, September 2013. [PDF])
Page 1 of 5512345...102030...»Last »