March 14th, 2012
March 12th, 2012
A new format for storing sparse matrices is suggested. It is designed to perform well mainly on GPU devices. Its implementation in CUDA is presented. Its performance is tested on 1600 different types of matrices. This format is compared in detail with a hybrid format, and strong and weak points of both formats are shown.
(Oberhuber T., Suzuki A., Vacata J.: “New Row-grouped CSR format for storing the sparse matrices on GPU with implementation in CUDA”, Acta Technica 56: 447-466, 2011 [PDF])
March 12th, 2012
We present a hybrid algorithm to compute convex hull of points in three and higher dimensional spaces. Our formulation uses a GPU-based interior point filter to cull away many of the points that do not belong to the boundary. The convex hull of remaining points is computed on the CPU. The GPU-based filter proceeds in an incremental manner and computes a pseudo-hull that is contained inside the convex hull of the original points. The pseudo-hull computation involves only localized operations and therefore, maps well to GPU architectures. Furthermore, the underlying approach extends to high dimensional point sets and deforming points. In practice, our culling filter can reduce the number of candidate points by two orders of magnitude. We have implemented the hybrid algorithm on commodity GPUs, and evaluated its performance on several large point sets. In practice, the GPU-based filtering algorithm can cull up to 85M interior points per second on NVIDIA GeForce GTX 580 and the hybrid algorithm improves the overall performance of convex hull computation by 10-27 times (for static point sets) and 22-46 times (for deforming point sets).
(Min Tang, Jie-yi Zhao, Ruofeng Tong, and Dinesh Manocha: “GPU accelerated Convex Hull Computation”, accepted by SMI’2012. [WWW] [PREPRINT])
March 6th, 2012
We study the use of a GPU for the numerical approximation of the curvature dependent flows of graphs – the mean-curvature flow and the Willmore flow. Both problems are often applied in image processing where fast solvers are required. We approximate these problems using the complementary finite volume method combined with the method of lines. We obtain a system of ordinary differential equations which we solve by the Runge–Kutta–Merson solver. It is a robust solver with an automatic choice of the integration time step. We implement this solver on CPU but also on GPU using the CUDA toolkit. We demonstrate that the mean-curvature flow can be successfully approximated in single precision arithmetic with the speed-up almost 17 on the Nvidia GeForce GTX 280 card compared to Intel Core 2 Quad CPU. On the same card, we obtain the speed-up 7 in double precision arithmetic which is necessary for the fourth order problem – the Willmore flow of graphs. Both speed-ups were achieved without affecting the accuracy of the approximation. The article is structured in such way that the reader interested only in the implementation of the Runge–Kutta–Merson solver on the GPU can skip the sections containing the mathematical formulation of the problems.
(Oberhuber T., Suzuki A., Žabka V.: “The CUDA implementation of the method of lines for the curvature dependent flows”, Kybernetika 47(2):251–272, 2011. [PDF])
March 2nd, 2012
PORTLAND, Ore., March 5 — The Portland Group, a wholly-owned subsidiary of STMicroelectronics, today announced availability of the 2012 release of the PGI line of high-performance parallelizing compilers and development tools for Linux, OS X and Windows. PGI 2012 is the first general release to include support for the OpenACC directive-based programming model for NVIDIA CUDA-enabled Graphics Processing Units (GPUs). This release is also the first to include the fully feature-enabled PGI CUDA C/C++ compiler for multi-core x64 CPUs from Intel and AMD. In addition, PGI 2012 includes a number of performance and feature enhancements for multi-core x64 processor-based HPC systems.
February 22nd, 2012
Partial differential equations are typically solved by means of finite difference, finite volume or finite element methods resulting in large, highly coupled, ill-conditioned and sparse (non-)linear systems. In order to minimize the computing time we want to exploit the capabilities of modern parallel architectures. The rapid hardware shifts from single core to multi-core and many-core processors lead to a gap in the progression of algorithms and programming environments for these platforms — the parallel models for large clusters do not fully utilize the performance capability of the multi-core CPUs and especially of the GPUs. Software stack needs to run adequately on the next generation of computing devices in order to exploit the potential of these new systems. Moving numerical software from one platform to another becomes an important task since every parallel device has its own programming model and language. The greatest challenge is to provide new techniques for solving (non-)linear systems that combine scalability, portability, fine-grained parallelism and flexibility across the assortment of parallel platforms and programming models. The goal of this thesis is to provide new fine-grained parallel algorithms embedded in advanced sparse linear algebra solvers and preconditioners on the emerging multi-core and many-core technologies.
Read the rest of this entry »
February 22nd, 2012
High Performance Graphics is the leading international forum for performance-oriented graphics systems research including innovative algorithms, efficient implementations, and hardware architecture. The conference brings together researchers, engineers, and architects to discuss the complex interactions of massively parallel hardware, novel programming models, efficient graphics algorithms, and novel applications. HPG2012, which will take place on June 25-27, is co-located with the Eurographics Symposium on Rendering in Paris, France.
Original and innovative performance-oriented contributions from all areas of graphics are cordially invited for both the papers and the posters track. Please refer to the conference website, located at http://www.highperformancegraphics.org, for more details and the full call.
February 13th, 2012
In recent years, utilizing Graphics Processing Units for general processing has become a very popular approach to obtain low-cost high performance computing applications. Algorithms from many computer science application domains have been adapted to utilize GPUs to increase the efficiency of processing. Unfortunately, while other application domains strongly benefit from utilizing the GPUs, databases related applications seem not to get enough attention. The main goal of GPUs in Databases workshop is to fill this gap. This event is devoted to sharing the knowledge related to applying GPUs in Database environments and to discuss possible future development of this application domain.
The list of topics includes: data compression on GPU, GPUs in databases and data warehouses, data mining using GPUs, stream processing, applications of GPUs in bioinformatics and data oriented GPU primitives.
Read the rest of this entry »
February 10th, 2012
Chai is a new managed platform for GPGPU. It is a free and open source clean room workalike of the PeakStream platform. While not production-ready, the just-released alpha version is able to compile and run non-trivial PeakStream demo code on AMD and NVIDIA GPUs (e.g. conjugate gradient).
Chai combines an application virtual machine, garbage collection, auto-tuning JIT compiler, and high level array programming language implemented as an embedded domain-specific language in C++. The JIT back-end uses expectation-maximization to auto-tune and generate vectorized OpenCL. The JIT includes auto-tuned model families for GEMM and GEMV. Although originally developed for AMD GPUs, these parameterized kernel families also generalize to NVIDIA GPUs.
February 1st, 2012
We describe our FE-gMG solver, a finite element geometric multigrid approach for problems relying on unstructured grids. We augment our GPU- and multicore-oriented implementation technique based on cascades of sparse matrix-vector multiplication by applying strong smoothers. In particular, we employ Sparse Approximate Inverse (SPAI) and Stabilised Approximate Inverse (SAINV) techniques. We focus on presenting the numerical efficiency of our smoothers in combination with low- and high-order finite element spaces as well as the hardware efficiency of the FE-gMG. For a representative problem and computational grids in 2D and 3D, we achieve a speedup of an average of 5 on a single GPU over a multithreaded CPU code in our benchmarks. In addition, our strong smoothers can deliver a speedup of 3-5 depending on the element space, compared to simple Jacobi smoothing. This can even be enhanced to a factor of 7 when combining the usage of Approximate Inverse-based smoothers with clever sorting of the degrees of freedom. In total the FE-gMG solver can outperform a simple, (multicore-)CPU-based multigrid by a total factor of over 40.
(Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac and Stefan Turek: “Towards a complete FEM-based simulation toolkit on GPUs: Unstructured Grid Finite Element Geometric Multigrid solvers with strong smoothers based on Sparse Approximate Inverses”, accepted for publication in Computers and Fluids, 2011. [preprint])
We present GPU and APU accelerated computations of Finite-Time Lyapunov Exponent (FTLE) fields. The calculation of FTLEs is a computationally intensive process, as in order to obtain the sharp ridges associated with the Lagrangian Coherent Structures an extensive resampling of the flow field is required. The computational performance of this resampling is limited by the memory bandwidth of the underlying computer architecture. The present technique harnesses data-parallel execution of many-core architectures and relies on fast and accurate evaluations of moment conserving functions for the mesh to particle interpolations. We demonstrate how the computation of FTLEs can be efficiently performed on a GPU and on an APU through OpenCL and we report over one order of magnitude improvements over multi-threaded executions in FTLE computations of bluff body flows. (Conti C., Rossinelli D., Koumoutsakos P., GPU and APU computations of Finite Time Lyapunov Exponent fields, Journal of Computational Physics, 231(5):2229–2244, 2012.