October 2nd, 2011
Abstract:

In this paper, we propose a fine-grained cycle sharing (FGCS) system capable of exploiting idle graphics processing units (GPUs) for accelerating sequence homology search in local area network environments. Our system exploits short idle periods on GPUs by running small parts of guest programs such that each part can be completed within hundreds of milliseconds. To detect such short idle periods from the pool of registered resources, our system continuously monitors keyboard and mouse activities via event handlers rather than waiting for a screensaver, as is typically deployed in existing systems. Our system also divides guest tasks into small parts according to a performance model that estimates execution times of the parts. This task division strategy minimizes any disruption to the owners of the GPU resources. Experimental results show that our FGCS system running on two non-dedicated GPUs achieves 111-116% of the throughput achieved by a single dedicated GPU. Furthermore, our system provides over two times the throughput of a screensaver-based system. We also show that the idle periods detected by our system constitute half of the system uptime. We believe that the GPUs hidden and often unused in office environments provide a powerful solution to sequence homology search.

(Fumihiko Ino, Yuma Munekawa, and Kenichi Hagihara, *“Sequence Homology Search using Fine-Grained Cycle Sharing of Idle GPUs”*, accepted for publication in IEEE Transactions on Parallel and Distributed Systems, Sep. 2011. [DOI])

Posted in Research | Tags: Bioinformatics, NVIDIA CUDA, Papers, Sequence Alignment | Write a comment

September 24th, 2011
The second 2-day CUDA programming workshop in Berlin takes place November 5-6. Course details, outline and prices are available at http://cuda.eventbrite.com.

Posted in Business, Events | Tags: Courses, NVIDIA CUDA | Write a comment

September 24th, 2011
The latest release of Symscape’s ofgpu (v0.2) for OpenFOAM® 2.0.x is now available. ofgpu is an open source experimental linear solver library that targets NVIDIA CUDA GPU devices on Windows, Linux, and (untested) Mac OS X. ofgpu now has support for the Cusp preconditioners:

- smoothed_aggregation – equivalent to Algebraic Multi-Grid (AMG)
- scaled_bridson_ainv
- bridson_ainv
- nonsym_bridson_ainv

Also supported is the option to select the GPU device. For more details see http://www.symscape.com/gpu-0-2-openfoam.

Posted in Developer Resources | Tags: Iterative Solvers, NVIDIA CUDA, Open Source, OpenFOAM | Write a comment

September 15th, 2011
AMD just released to open source a project called Aparapi that started in their JavaLabs team. Aparapi is an API for expressing data parallel workloads in Java and a runtime component capable of converting the Java bytecode of compatible workloads into OpenCL™ so that it can be executed on a variety of GPU devices. More information can be found in this blog entry.

Posted in Developer Resources | Tags: AMD, Java, Open Source, OpenCL, Tools | Write a comment

September 12th, 2011
Abstract:

This chapter demonstrates how to leverage the Thrust parallel template library to implement high-performance applications with minimal programming effort. Based on the C++ Standard Template Library (STL), Thrust brings a familiar high-level interface to the realm of GPU Computing while remaining fully interoperable with the rest of the CUDA software ecosystem. Applications written with Thrust are concise, readable, and efficient.

(Nathan Bell and Jared Hoberock: *“Thrust: A Productivity-Oriented Library for CUDA”*, GPU Computing Gems, Jade Edition, edited by Wen-mei W. Hwu, October 2011)

Posted in Developer Resources, Research | Tags: Libraries, NVIDIA CUDA, Papers, Tools | 1 Comment

September 10th, 2011
From the abstract of a GPU market analysis whitepaper by John Peddie Research:

Computer graphics is hard work. Behind the images you see in games and movies, or while editing photos or video, some serious processing is taking place. All the processing power you can muster is needed to push and polish pixels. And this task is only going to get more demanding as these applications get more sophisticated. Graphics Processing Units (GPUs), which do the heavy lifting in computer graphics, range greatly in size, price and performance. They span from tiny cores inside an ARM processor (such as Nvidia’s Tegra or Qualcomm’s Snapdragon), to graphics integrated within an X86 processor (such as AMD’s Fusion, Intel’s Sandy Bridge), to a standalone discrete device, or dGPU (such as AMD’s Radeon, or Nvidia’s GeForce).

More information: http://jonpeddie.com/media/presentations/an-analysis-of-the-gpu-market/

Posted in Business | Tags: GPUs, Market | 1 Comment

September 8th, 2011
libCL is an open-source parallel algorithm library written in C++ and OpenCL. Rather than a specific domain, libCL intends to encompass a wide range of parallel algorithms and data structures. The goal is to provide a comprehensive repository for high performance visual-centric computing ranging from fundamental primitives such as sorting, searching and algebra to advanced systems of algorithms for computational research and visualization. The current distribution of libCL already contains entirely parallelized implementations of the following algorithms:

- Bounding volume hierarchy construction
- Smoothed particle hydrodynamics
- Radix sort
- Adaptive tone-mapping
- Screen-space ambient occlusion culling
- Bilateral and Recursive Gaussian

libCL emerged out of OpenCL Studio, and as such integrates well with the development environment and its visualization capabilities. libCL is Open Source and released under the Apache license.

Posted in Developer Resources | Tags: Open Source, OpenCL | Write a comment

September 4th, 2011
Abstract:

We parallelize a version of the active-set iterative algorithm derived from the original works of Lawson and Hanson (1974) on multi-core architectures. This algorithm requires the solution of an unconstrained least squares problem in every step of the iteration for a matrix composed of the passive columns of the original system matrix. To achieve improved performance, we use parallelizable procedures to efficiently update and {\em downdate} the QR factorization of the matrix at each iteration, to account for inserted and removed columns. We use a reordering strategy of the columns in the decomposition to reduce computation and memory access costs. We consider graphics processing units (GPUs) as a new mode for efficient parallel computations and compare our implementations to that of multi-core CPUs. Both synthetic and non-synthetic data are used in the experiments.

(Yuancheng Luo and Ramani Duraiswami, *“Efficient Parallel Non-Negative Least Squares on Multicore Architectures”*, SIAM Journal on Scientific Computing, accepted, Sep. 2011. [PDF] [Source code])

Posted in Research | Tags: Least-squares, Linear Algebra, Numerical Algorithms, NVIDIA CUDA, Papers | Write a comment

September 3rd, 2011
NVIDIA is looking for research posters and speakers for their upcoming events including GTC Express @ SC’11, GTC Asia and GTC U.S. More information about the events, submission procedures and the speaking opportunities can be found here, and the submission system is available at this page.

Posted in Events | Tags: Conferences | Write a comment

September 2nd, 2011
Abstract:

A Helmholtz equation in two dimensions discretized by a second order finite difference scheme is considered. Krylov methods such as Bi-CGSTAB and IDR(s) have been chosen as solvers. Since the convergence of the Krylov solvers deteriorates with increasing wave number, a shifted Laplace multigrid preconditioner is used to improve the convergence. The implementation of the preconditioned solver on CPU (Central Processing Unit) is compared to an implementation on GPU (Graphics Processing Units or graphics card) using CUDA (Compute Unified Device Architecture). The results show that preconditioned Bi-CGSTAB on GPU as well as preconditioned IDR(s) on GPU is about 30 times faster than on CPU for the same stopping criterion.

(H. Knibbe, C.W. Oosterlee and C. Vuik, *“GPU implementation of a Helmholtz Krylov solver preconditioned by a shifted Laplace multigrid method”*, accepted for publication in the Journal of Computational and Applied Mathematics, 2011. [DOI])

Posted in Research | Tags: Multigrid, Numerical Algorithms, NVIDIA CUDA, Papers | Write a comment