One of the keys to achieving maximum performance in CUDA is taking advantage of the various memory spaces. Part II of Acceleware’s tutorial has now been published. The tutorial uses a simple encryption kernel to test and compare read-only cache, constant cache and global memory. Read the full tutorial…
2014 International Work-Conference on Bioinformatics and Biomedical Engineering (IWBBIO 2014)
7-9 April, 2014. Granada (SPAIN). Special Session: High Performance Computing in Bioinformatics
The goal of this special session is to explore the use of emerging parallel computing architectures as well as High Performance Computing systems (Supercomputers, Clusters, Grids) for the simulation of relevant biological systems and for applications in Bioinformatics, Computational Biology and Computational Chemistry. We welcome papers, not submitted elsewhere for review, with a focus in topics of interest ranging from but not limited to: Read the rest of this entry »
A free webinar on accelerating face-in-the-crowd recognition with GPU technology will be held on November 5th. It teaches how GPUs can be used to accelerate face detection and recognition of people in the crowd. The presentation will also cover the speakers’ use of ROS, OpenCV, OpenMP, and Armadillo libraries to develop fast reliable distributed video processing code. To register follow the link: https://www2.gotomeeting.com/register/292953058
We present a GPU-based streaming algorithm to perform high-resolution and accurate cloth simulation. We map all the components of cloth simulation pipeline, including time integration, collision detection, collision response, and velocity updating to GPU-based kernels and data structures. Our algorithm perform intra-object and interobject collisions, handles contacts and friction, and is able to accurately simulate folds and wrinkles. We describe the streaming pipeline and address many issues in terms of obtaining high throughput on many-core GPUs. In practice, our algorithm can perform high-fidelity simulation on a cloth mesh with 2M triangles using 3GB of GPU memory. We highlight the parallel performance of our algorithm on three different generations of GPUs. On a high-end NVIDIA Tesla K20c, we observe up to two orders of magnitude performance improvement as compared to a single-threaded CPU-based algorithm, and about one order of magnitude improvement over a 16-core CPUbased parallel implementation.
(Min Tang, Roufeng Tong, Rahul Narain, Chang Meng and Dinesh Manocha: “A GPU-based Streaming Algorithm for High-Resolution Cloth Simulation”, in the Proceedings of Pacific Graphics 2013. [WWW])
The computational investigation of a biological system often requires the execution of a large number of simulations to analyze its dynamics, and to derive useful knowledge on its behavior under physiological and perturbed conditions. This analysis usually turns out into very high computational costs when simulations are run on central processing units (CPUs), therefore demanding a shift to the use of high-performance processors. In this work we present a simulator of biological systems, called cupSODA, which exploits the higher memory bandwidth and computational capability of graphics processing units (GPUs). This software allows to execute parallel simulations of the dynamics of biological systems, by first deriving a set of ordinary differential equations from reaction-based mechanistic models defined according to the mass-action kinetics, and then exploiting the numerical integration algorithm LSODA. We show that cupSODA can achieve a 112× speedup on GPUs with respect to equivalent executions of LSODA on CPUs.
(Nobile M.S., Besozzi D., Cazzaniga P., Mauri G., Pescini D.: “cupSODA: a CUDA-Powered Simulator of Mass-action Kinetics”, In 12th International Conference on Parallel Computing Technologies (PaCT), Lecture Notes in Computer Science, volume 7979, pp. 344-357, 2013. [DOI])
PARALUTION is a library for sparse iterative methods which can be performed on various parallel devices, including multi-core CPU and GPU. In the new 0.4.0 version, the library provides also a backend for Xeon Phi (MIC). With this new version, various performance benchmarks based on vector-vector routines, sparse matrix-vector multiplication and CG method on different backends have been released: OpenMP/CUDA/OpenCL- NVIDIA GPU, AMD GPU, CPU and Xeon Phi. More information: http://www.paralution.com/benchmarks/
The use of GPUs to accelerate general-purpose scientific and engineering applications is mainstream today, but their adoption in current high-performance computing clusters is impaired primarily by acquisition costs and power consumption. Therefore, the benefits of sharing a reduced number of GPUs among all the nodes of a cluster can be remarkable for many applications. This approach, usually referred to as remote GPU virtualization, aims at reducing the number of GPUs present in a cluster, while increasing their utilization rate. The performance of the interconnection network is key to achieving reasonable performance results by means of remote GPU virtualization. To this end, several networking technologies with throughput comparable to that of PCI Express have appeared recently. In this paper we analyze the influence of InfiniBand FDR on the performance of remote GPU virtualization, comparing its impact on a variety of GPU-accelerated applications with other networking technologies, such as InfiniBand QDR and Gigabit Ethernet. Given the severe limitations of freely available remote GPU virtualization solutions, the rCUDA framework is used as the case study for this analysis. Results show that the new FDR interconnect, featuring higher bandwidth than its predecessors, allows the reduction of the overhead of using GPUs remotely, thus making this approach even more appealing.
(Carlos Reano, Rafael Mayo, Enrique S. Quintana-Ortí, Federico Silla, José Duato and Antonio J. Pena: “Influence of InfiniBand FDR on the Performance of Remote GPU Virtualization”. Proceedings of the IEEE Cluster 2013 Conference, Indianapolis, USA, September 2013. [PDF])
This paper presents an accelerated version of copy-move image forgery detection scheme on the Graphics Processing Units or GPUs. With the replacement of analog cameras with their digital counterparts and availability of powerful image processing software packages, authentication of digital images has gained importance in the recent past. This paper focuses on improving the performance of a copy-move forgery detection scheme based on radix sort by porting it onto the GPUs. This scheme has enhanced performance and is much more efficient compared to other methods without degradation of detection results. The CPU version of the radix-sort based detection scheme was developed in Matlab and critical sections of the CPU version were coded in C-language using Matlab’s Mex interface to get the maximum performance. The GPU version was developed using Jacket GPU Engine for Matlab and performs over twelve times faster than its optimized CPU variant. The contribution this paper makes towards blind image forensics is the use of integral images for computing feature vectors of overlapping blocks in block-matching technique and acceleration of the entire copy-move forgery detection scheme on the GPUs, not found in literature.
(Jaideep Singh and Balasubramanian Raman, “A High Performance Copy-Move Image Forgery Detection Scheme on GPU”, Advances in Intelligent and Soft Computing Volume 131, 2012, pp 239-246, Proceedings of the International Conference on Soft Computing for Problem Solving (SocProS 2011). [DOI])
The GPGPU.org editors (Mark Harris and Dominik Goeddeke) were excited to notice that our previous post was the 1000th post on GPGPU.org! GPGPU continues to grow as fast as ever, and we’re excited to see what it brings in the future.
We’d like to use this opportunity to remind you that GPGPU.org relies on our readers to submit news. So please, visit http://gpgpu.org/submit-news/ and tell us about your project, publications, or the work of others you think we should share.
Thanks for an amazing 11 years and hundreds of news submissions!
Numerical Computations with GPUs, to be published by Springer, will contain a collection of articles on core numerical methods adapted for Graphics Processing Units (GPUs). Classical numerical methods (solution of linear equations, FFT, etc.) are at the core of many scientific and engineering computations. In recent years substantial efforts were undertaken to adapt these methods for recently emerged GPU-based systems. The book is envisioned as a consolidation of such work into a single volume covering widely used methods and techniques. Each chapter will provide mathematical background, parallel algorithm, and implementation details leading to reusable, adaptable, and scalable code fragments. Each chapter will be accompanied with a basic CUDA or OpenCL source code that can be used by the readers as a starting point for adaptation in their applications. The book will serve as a GPU implementation manual for many numerical algorithms providing valuable insights into parallelization strategies for GPUs as well as ready-to-use code fragments with a broad appeal to both developers and researchers interested in GPU computing.
Authors interested in contributing to this volume are asked to submit a short proposal via EasyChair (https://www.easychair.org/conferences/?conf=ncgpu14) by October 15, 2013. Authors of the accepted/invited chapters are expected to write and submit to the editor completed chapters by January 31, 2014. For more details see full solicitation (http://www.ncsa.illinois.edu/~kindr/editorial/ncgpu/solicitation.pdf) or contact the Editor at firstname.lastname@example.org.