The wall shear stress is a quantity of profound importance for clinical diagnosis of artery diseases. The lattice Boltzmann is an easily parallelizable numerical method of solving the flow problems, but it suffers from errors of the velocity field near the boundaries which leads to errors in the wall shear stress and normal vectors computed from the velocity. In this work we present a simple formula to calculate the wall shear stress in the lattice Boltzmann model and propose to compute wall normals, which are necessary to compute the wall shear stress, by taking the weighted mean over boundary facets lying in a vicinity of a wall element. We carry out several tests and observe an increase of accuracy of computed normal vectors over other methods in two and three dimensions. Using the scheme we compute the wall shear stress in an inclined and bent channel fluid flow and show a minor influence of the normal on the numerical error, implying that that the main error arises due to a corrupted velocity field near the staircase boundary. Finally, we calculate the wall shear stress in the human abdominal aorta in steady conditions using our method and compare the results with a standard finite volume solver and experimental data available in the literature. Applications of our ideas in a simplified protocol for data preprocessing in medical applications are discussed.
(Maciej Matyka, Zbigniew Koza, Łukasz Mirosław: “Wall Orientation and Shear Stress in the Lattice Boltzmann Model”, Preprint, 2012. [arXiv])
A new format for storing sparse matrices is proposed for efficient sparse matrix-vector (SpMV) product calculation on modern throughput-oriented computer architectures. This format extends the standard compressed row storage (CRS) format and is easily convertible to and from it without any memory overhead. Computational performance of an SpMV kernel for the new format is determined for over 140 sparse matrices on two Fermi-class graphics processing units (GPUs) and the efficiency of the kernel, which peaks at 36 and 25 GFLOPS at single and double precision, respectively, is compared with that of five existing generic algorithms and industrial implementations. The efficiency of the new format is also measured as a function of the mean (mu) and of the standard deviation (sigma) of the number of matrix nonzero elements per row. The largest speedup is found for matrices with mu > 20 and mu > sigma > 1.5 and can be as high as 43%.
(Zbigniew Koza, Maciej Matyka, Sebastian Szkoda, Łukasz Mirosław: “Compressed Multiple-Row Storage Format”, Preprint, 2012. [arXiv])
A new format for storing sparse matrices is suggested. It is designed to perform well mainly on GPU devices. Its implementation in CUDA is presented. Its performance is tested on 1600 different types of matrices. This format is compared in detail with a hybrid format, and strong and weak points of both formats are shown.
(Oberhuber T., Suzuki A., Vacata J.: “New Row-grouped CSR format for storing the sparse matrices on GPU with implementation in CUDA”, Acta Technica 56: 447-466, 2011 [PDF])
UKPEW is the leading UK forum for the presentation of all aspects of performance modelling and analysis of computer and telecommunication systems. Original papers are invited on all relevant topics but papers on or related to the subjects listed below are particularly welcome.
Topics of interest include, but are not limited to:
Read the rest of this entry »
We present a hybrid algorithm to compute convex hull of points in three and higher dimensional spaces. Our formulation uses a GPU-based interior point filter to cull away many of the points that do not belong to the boundary. The convex hull of remaining points is computed on the CPU. The GPU-based filter proceeds in an incremental manner and computes a pseudo-hull that is contained inside the convex hull of the original points. The pseudo-hull computation involves only localized operations and therefore, maps well to GPU architectures. Furthermore, the underlying approach extends to high dimensional point sets and deforming points. In practice, our culling filter can reduce the number of candidate points by two orders of magnitude. We have implemented the hybrid algorithm on commodity GPUs, and evaluated its performance on several large point sets. In practice, the GPU-based filtering algorithm can cull up to 85M interior points per second on NVIDIA GeForce GTX 580 and the hybrid algorithm improves the overall performance of convex hull computation by 10-27 times (for static point sets) and 22-46 times (for deforming point sets).
(Min Tang, Jie-yi Zhao, Ruofeng Tong, and Dinesh Manocha: “GPU accelerated Convex Hull Computation”, accepted by SMI’2012. [WWW] [PREPRINT])
We study the use of a GPU for the numerical approximation of the curvature dependent flows of graphs – the mean-curvature flow and the Willmore flow. Both problems are often applied in image processing where fast solvers are required. We approximate these problems using the complementary finite volume method combined with the method of lines. We obtain a system of ordinary differential equations which we solve by the Runge–Kutta–Merson solver. It is a robust solver with an automatic choice of the integration time step. We implement this solver on CPU but also on GPU using the CUDA toolkit. We demonstrate that the mean-curvature flow can be successfully approximated in single precision arithmetic with the speed-up almost 17 on the Nvidia GeForce GTX 280 card compared to Intel Core 2 Quad CPU. On the same card, we obtain the speed-up 7 in double precision arithmetic which is necessary for the fourth order problem – the Willmore flow of graphs. Both speed-ups were achieved without affecting the accuracy of the approximation. The article is structured in such way that the reader interested only in the implementation of the Runge–Kutta–Merson solver on the GPU can skip the sections containing the mathematical formulation of the problems.
(Oberhuber T., Suzuki A., Žabka V.: “The CUDA implementation of the method of lines for the curvature dependent flows”, Kybernetika 47(2):251–272, 2011. [PDF])
PORTLAND, Ore., March 5 — The Portland Group, a wholly-owned subsidiary of STMicroelectronics, today announced availability of the 2012 release of the PGI line of high-performance parallelizing compilers and development tools for Linux, OS X and Windows. PGI 2012 is the first general release to include support for the OpenACC directive-based programming model for NVIDIA CUDA-enabled Graphics Processing Units (GPUs). This release is also the first to include the fully feature-enabled PGI CUDA C/C++ compiler for multi-core x64 CPUs from Intel and AMD. In addition, PGI 2012 includes a number of performance and feature enhancements for multi-core x64 processor-based HPC systems.
Partial differential equations are typically solved by means of finite difference, finite volume or finite element methods resulting in large, highly coupled, ill-conditioned and sparse (non-)linear systems. In order to minimize the computing time we want to exploit the capabilities of modern parallel architectures. The rapid hardware shifts from single core to multi-core and many-core processors lead to a gap in the progression of algorithms and programming environments for these platforms — the parallel models for large clusters do not fully utilize the performance capability of the multi-core CPUs and especially of the GPUs. Software stack needs to run adequately on the next generation of computing devices in order to exploit the potential of these new systems. Moving numerical software from one platform to another becomes an important task since every parallel device has its own programming model and language. The greatest challenge is to provide new techniques for solving (non-)linear systems that combine scalability, portability, fine-grained parallelism and flexibility across the assortment of parallel platforms and programming models. The goal of this thesis is to provide new fine-grained parallel algorithms embedded in advanced sparse linear algebra solvers and preconditioners on the emerging multi-core and many-core technologies.
Read the rest of this entry »
This Dr. Dobb’s Article by Rob Farber provides a tutorial on creating application plugins to accelerate Windows and Linux application performance using CUDA in dynamically loaded libraries.
Adding GPU capabilities to existing Windows and Linux apps can be done simply using plugins and the built-in support found in CUDA. This easy form of dynamic loading enables CUDA to be used selectively to hugely accelerate individual tasks within a larger application.
CUDA is maturing to become a natural extension of the emerging CPU/GPU paradigm of high-speed computing to make it, and GPU computing, a candidate for all application development. A recent article in this series tutorial series, Running CUDA Code Natively on x86 Processors, noted recent developments that allow CUDA programs to transparently compile and run on x86 processors. This article focuses on incorporating CUDA into Windows and Linux workflows by exploiting the capabilities of the NVIDIA compiler driver, nvcc, to create native runtime loadable plugins. Source code is provided to create and utilize CUDA plugins and even dynamically compile and link a CUDA source file into a running application (just like the OpenCL). Read the rest of this entry »
Developed in partnership with AMD, this four day course is designed for GPU Programmers who are looking to develop comprehensive skills in writing and optimizing applications that fully leverage the multi-core processing capabilities of the GPU.
Delivered by Acceleware’s Developers, who provide real world experience and examples, the training comprises classroom lectures and hands-on tutorials. Each student will be supplied with a laptop equipped with an AMD Fusion APU for the duration of the course. Small class sizes maximize learning and ensure a personal educational experience. Read the rest of this entry »