Improved Row-grouped CSR Format for Storing of Sparse Matrices on GPU

October 30th, 2012


We present new format for storing sparse matrices on GPU. We compare it with several other formats including CUSPARSE which is today probably the best choice for processing of sparse matrices on GPU in CUDA. Contrary to CUSPARSE which works with common CSR format, our new format requires conversion. However, multiplication of sparse-matrix and vector is significantly faster for many matrices. We demonstrate it on set of 1 600 matrices and we show for what types of matrices our format is profitable.

(Heller M., Oberhuber T.: “Improved Row-grouped CSR Format for Storing of Sparse Matrices on GPU”, Proceedings of Algoritmy 2012, 2012, Handlovičová A., Minarechová Z. and Ševčovič D. (ed.), pages 282-290, ISBN 978-80-227-3742-5) [ARXIV preprint]

GPU Simulations of Gravitational Many-body Problem and GPU Octrees

January 20th, 2010

This undergraduate thesis and poster by Kajuki Fujiwara and  Naohito Nakasato from the University of Aizu approach a common problem in astrophysics: the many-body problem, with both brute-force and hierarchical data structures for solving it on ATI GPUs.  Abstracts:

Fast Simulations of Gravitational Many-body Problem on RV770 GPU
Kazuki FujiwaraNaohito Nakasato (University of Aizu)

The gravitational many-body problem is a problem concerning the movement of bodies, which are interacting through gravity. However, solving the gravitational many-body problem with a CPU takes a lot of time due to O(N^2) computational complexity. In this paper, we show how to speed-up the gravitational many-body problem by using GPU. After extensive optimizations, the peak performance obtained so far is about 1 Tflops.

Oct-tree Method on GPU

The kd-tree is a fundamental tool in computer science. Among others, an application of the kd-tree search (oct-tree method) to fast evaluation of particle interactions and neighbor search is highly important since computational complexity of these problems are reduced from O(N^2) with a brute force method to O(N log N) with the tree method where N is a number of particles. In this paper, we present a parallel implementation of the tree method running on a graphic processor unit (GPU). We successfully run a simulation of structure formation in the universe very efficiently. On our system, which costs roughly $900, the run with N ~ 2.87×10^6 particles took 5.79 hours and executed 1.2×10^13 force evaluations in total. We obtained the sustained computing speed of 21.8 Gflops and the cost per Gflops of 41.6/Gflops that is two and half times better than the previous record in 2006.

Case studies on GPU usage and data structure design

August 11th, 2008


Big improvements in the performance of graphics processing units (GPUs) turned them into a compelling platform for high performance computing. In this thesis, we discuss the usage of NVIDIA’s CUDA in two applications — Einstein@Home, a distributed computing software, and OpenSteer, a game-like application. Our work on Einstein@Home demonstrates that CUDA can be integrated into existing applications with minimal changes, even in programs designed without considering GPU usage. However the existing data structure of Einstein@Home performs poorly when used on the GPU. We demonstrate that using a redesigned data structure improves the performance to about three times as fast as the original CPU version, even though the code executed on the device is not optimized. We further discuss the design of a novel spatial data structure called “dynamic grid” that is optimized for CUDA usage. We measure its performance by integrating it into the Boids scenario of OpenSteer. Our new concept outperforms a uniform grid by a margin of up to 15%, even though the dynamic grid still provides optimization potential.

(Case studies on gpu usage and data structure design. J. Breitbart, Master’s thesis, Universität Kassel, 2008)

CUDA Data-Parallel Primitives Library Released

November 5th, 2007

CUDPP is the CUDA Data Parallel Primitives Library for NVIDIA CUDA. CUDPP is a library of data-parallel algorithm primitives such as parallel-prefix-sum (“scan”), parallel sort and parallel reduction. Primitives such as these are important building blocks for a wide variety of data-parallel algorithms, including sorting, stream compaction, and building data structures such as trees and summed-area tables. The first beta release of CUDPP is now available, as is the searchable online documentation.

Ph.D. Dissertation: Glift Generic GPU Data Structures, by Aaron Lefohn

January 18th, 2007

This Ph.D. dissertation by Aaron Lefohn at the University of California, Davis describes the Glift GPU data structure abstraction and its application to both GPU-based data-parallel and interactive rendering algorithms. The applications include octree 3D painting, adaptive shadow maps, resolution matched shadow maps, heat-diffusion depth-of-field, and a GPU-based direct solver for tridiagonal linear systems. While much of this work has been posted previously, this dissertation contains a more in-depth discussion of the Glift data structure library and introduces several GPGPU and rendering algorithms that are not yet published. This dissertation demonstrates that a data structure abstraction for GPUs can simplify the description of new and existing data structures, stimulate development of complex GPU algorithms, and perform equivalently to hand-coded implementations. The dissertation also presents a case that future interactive rendering solutions will be an inseparable mix of general-purpose, data-parallel algorithms and traditional graphics programming. (Aaron Lefohn, “Glift: Generic Data Structures for Graphics Hardware”, Ph.D. dissertation, Computer Science Department, University of California Davis, September 2006.)

SIGGRAPH Poster: GPU Histogram Computation

August 10th, 2006

This SIGGRAPH poster by Oliver Fluck et al. presents an approach to computing histograms in fragment shaders. The proposed method enables iterative and histogram-guided algorithms to run on GPUs and avoids data transfer between the GPU and main memory. The algorithm has been demonstrated using the example of a GPU level set segmentation. (GPU Histogram Computation)

Fast GPU Ray Tracing of Dynamic Meshes using Geometry Images

March 17th, 2006

Using the GPU to accelerate ray tracing may seem like a natural choice due to the highly parallel nature of the problem. However, determining the most versatile GPU data structure for scene storage and traversal is a challenge. In this paper, we introduce a new method for quick intersection of triangular meshes on the GPU. The method uses a threaded bounding volume hierarchy built from a geometry image, which can be efficiently traversed and constructed entirely on the GPU. This acceleration scheme is highly competitive with other GPU ray tracing methods, while allowing for both dynamic geometry and an efficient level of detail scheme at no extra cost. (Fast GPU Ray Tracing of Dynamic Meshes using Geometry Images Nathan A. Carr, Jared Hoberock, Keenan Crane, and John C. Hart. To appear in Proceedings of Graphics Interface 2006)

Glift: Generic, Efficient, Random-Access GPU Data Structures

February 10th, 2006

This paper presents Glift, an abstraction and generic template library for defining complex, random-access graphics processor (GPU) data structures. Like modern CPU data structure libraries, Glift enables GPU programmers to separate algorithms from data structure definitions; thereby greatly simplifying algorithmic development and enabling reusable and interchangeable data structures. We characterize a large body of previously published GPU data structures in terms of our abstraction and present several new GPU data structures. The structures, a stack, quadtree, and octree, are explained using simple Glift concepts and implemented using reusable Glift components. We also describe two applications of these structures not previously demonstrated on GPUs: adaptive shadow maps and octree 3D paint. Lastly, we show that our example Glift data structures perform comparably to handwritten implementations while requiring only a fraction of the programming effort. (Glift: Generic, Efficient, Random-Access GPU Data Structures. Aaron E. Lefohn, Joe Kniss, Robert Strzodka, Shubhabrata Sengupta, John D. Owens. ACM Transactions on Graphics, 25(1), Jan. 2006.)

Stack Implementation on Programmable Graphics Hardware

June 12th, 2005

This paper by Ernst et al. describes a stack implementation for the GPU using textures for storage. For a predefined maximum stack depth, k, either k data textures, or a single large texture with k stack layers side by side are used. Additionally a stack pointer texture is needed. The paper argues that both push and pop can become O(1) operations using fragment program branching. Both push and pop require separate rendering passes. The technique is demonstrated in a kd-tree traversal implementation. (gpu stack bibtex)

DuoDecim – A Structure for Point Scan Compression and Rendering

May 26th, 2005

This paper presents a compression scheme for large point scans including per-point normals. For the encoding of such scans, the paper introduces a type of closest sphere packing grids, the hexagonal close packing (HCP). To compress the data, linear sequences of filled cells in HCP grids are extracted. Point positions and normals in these runs are incrementally encoded. At a grid spacing close to the point sampling distance, the compression scheme only requires slightly more than 3 bits per point position. Incrementally encoded per-point normals are quantized at high fidelity using only 5 bits per normal. The compressed data stream can be decoded in the graphics processing unit (GPU). Decoded point positions are saved in graphics memory, and they are then used on the GPU again to render point primitives. In this way gigantic point scans are rendered from their compressed representation in local GPU memory at interactive frame rates. (