The use of image denoising techniques is an important part of many medical imaging applications. One common application is to improve the image quality of low-dose, i.e. noisy, computed tomography (CT) data. The medical imaging domain has seen a tremendous development during the last decades. It is now possible to collect time resolved volumes, i.e. 4D data, with a number of modalities (e.g. ultrasound (US), CT, magnetic resonance imaging (MRI)). While 3D image denoising previously has been applied to several volumes independently, there has not been much work done on true 4D image denoising, where the algorithm considers several volumes at the same time (and not a single volume at a time). By using all the dimensions, it is for example possible to remove some of the time varying reconstruction artefacts that exist in CT volumes. The problem with 4D image denoising, compared to 2D and 3D denoising, is that the computational complexity increases exponentially. In this paper we describe a novel algorithm for true 4D image denoising, based on local adaptive ﬁltering, and how to implement it on the graphics processing unit (GPU). The algorithm was applied to a 4D CT heart dataset of the resolution 512 x 512 x 445 x 20. The result is that the GPU can complete the denoising in about 25 minutes if spatial ﬁltering is used and in about 8 minutes if FFT based ﬁltering is used. The CPU implementation requires several days of processing time for spatial ﬁltering and about 50 minutes for FFT based ﬁltering. Fast spatial ﬁltering makes it possible to apply the denoising algorithm to larger datasets (compared to if FFT based ﬁltering is used). The short processing time increases the clinical value of true 4D image denoising signiﬁcantly.
(Anders Eklund, Mats Andersson, Hans Knutsson: “True 4D Image Denoising on the GPU”, International Journal of Biomedical Imaging, Article ID 952819, 2011 [Youtube Video] [PDF])
In this work, we present an interactive visual clustering approach for the exploration and analysis of vast volumes of data. Our proposed approach is a bio-inspired collective behavioral model to be used in a 3D graphics environment. Our paper illustrates an extension of the behavioral model for clustering and a parallel implementation, using Compute Unified Device Architecture to exploit the computational power of Graphics Processor Units (GPUs). The advantage of our approach is that, as data enters the environment, the user is directly involved in the data mining process. Our experiments illustrate the effectiveness and efficiency provided by our approach when applied to a number of real and synthetic data sets.
(U. Erra, B. Frola, and V. Scarano: “A GPU-based Interactive Bio-inspired Visual Clustering”, Proceedings of the 2011 IEEE Symposium on Computational Intelligence and Data Mining. Paris, France. April 11-15, 2011 [PDF] [Video])
Application demands and grand challenges in numerical simulation require for both highly capable computing platforms and efficient numerical solution schemes. Power constraints and further miniaturization of modern and future hardware give way for multi- and manycore processors with increasing fine-grained parallelism and deeply nested hierarchical memory systems — as already exemplified by recent graphics processing units. Accordingly, numerical schemes need to be adapted and re-engineered in order to deliver scalable solutions across diverse processor configurations. Portability of parallel software solutions across emerging hardware platforms is another challenge. This work investigates multi-coloring and re-ordering schemes for block Gauss-Seidel methods and, in particular, for incomplete LU factorizations with and without fill-ins. We consider two matrix re-ordering schemes that deliver flexible and efficient parallel preconditioners. The general idea is to generate block decompositions of the system matrix such that the diagonal blocks are diagonal itself. In such a way, parallelism can be exploited on the block-level in a scalable manner. Our goal is to provide widely applicable, out-of-the-box preconditioners that can be used in the context of finite element solvers.
We propose a new method for anticipating the fill-in pattern of ILU(p) schemes which we call the power(q)-pattern method. This method is based on an incomplete factorization of the system matrix A subject to a predetermined pattern given by the matrix power |A|p+1 and its associated multi-coloring permutation pi. We prove that the obtained sparsity pattern is a superset of our modified ILU(p) factorization applied to pi A pi-1. As a result, this modified ILU(p) applied to multi-colored system matrix has no fill-ins in its diagonal blocks. This leads to an inherently parallel execution of triangular ILU(p) sweeps.
In addition, we describe the integration of the preconditioners into the HiFlow3 open-source finite element package that provides a portable software solution across diverse hardware platforms. On this basis, we conduct performance analysis across a variety of test problems on multi-core CPUs and GPUs that proves efficiency, scalability and flexibility of our approach. Our preconditioners achieve a solver acceleration by a factor of up to 1.5, 8 and 85 for three different test problems. The GPU versions of the preconditioned solver are by a factor of up to 4 faster than an OpenMP parallel version on eight cores.
(Vincent Heuveline, Dimitar Lukarski and Jan-Philipp Weiss: “Enhanced Parallel ILU(p)-based Preconditioners for Multi-core CPUs and GPUs — The Power(q)-pattern Method”, EMCL Preprint Series, number 08, July 2011 [PDF])
CUDA Template Generator is a Java application that allows generates CUDA C source file templates based on user input parameters. Features include :
- An algorithm for automatic block and thread definition, depending on array size.
- Automatic memory transfer functions for CPU->GPU->CPU communication.
- Generated C source code function template to use in your application.
Developed by Pavel Kartashev, as part of his Master’s Degree work.
We are pleased to announce a three-day workshop on “Programming of Heterogeneous Systems in Physics”, a workshop to be held on 5-7 October 2011 at Friedrich-Schiller University, Jena, Germany. This workshop will focus on:
- Solving partial differential equations efficiently on the heterogeneous computing systems. There is some emphasis on GPU computing, but other accelerators and the efficient use of large multi-core cluster nodes are considered as well.
- Optimization of computational kernels coming from finite differences, spectral methods, and lattice gauge theory on accelerators.
- We plan to have a tutorial day, two days of talks and a poster session. We plan for discussion and talks to provide an overview of current work in these areas, and to develop future lines of research and collaborations. The deadline for submission of talks is 15 August 2011.
Please visit http://wwwsfb.tpi.uni-jena.de/Events/Event-PHSP11.shtml for more information. This workshop is organised by G. Zumbusch (Chair, Jena), B. Bruegmann (Jena), A. Weyhausen (Jena), L. Rezzolla (Potsdam) and B. Zink (Tuebingen).
AMD announced a GPGPU coding competition, called AMD OpenCL Coding Competition. The first phase of the competition is an open innovation challenge that requires the use of the AMD APP SDK and OpenCL. The competition is heating up with the highest registration for a TopCoder innovation challenge to date. It’s not too late to sign up and show off your ideas! If you submit your abstract before June 30th you will get feedback from AMD, otherwise you will have up until the deadline to submit your OpenCL innovation challenge submission.
Phase two of the competition will be an OpenCL algorithm optimization match that will start later in September. Read more about it in this AMD blog.
Microsoft has announced that the next version of Visual Studio will contain technology labeled C++ Accelerated Massive Parallelism (C++ AMP) to enable C++ developers to take advantage of the GPU for computation purposes. More information is available in the MSDN blog posts here and here.
Intel has announced ispc, The Intel SPMD Program Compiler, now available in source and binary form from http://ispc.github.com.
ispc is a new compiler for “single program, multiple data” (SPMD) programs; the same model that is used for (GP)GPU programming, but here targeted to CPUs. ispc compiles a C-based SPMD programming language to run on the SIMD units of CPUs; it frequently provides a a 3x or more speedup on CPUs with 4-wide SSE units, without any of the difficulty of writing intrinsics code. There were a few principles and goals behind the design of ispc:
- To build a small C-like language that would deliver excellent performance to performance-oriented programmers who want to run SPMD programs on the CPU.
- To provide a thin abstraction layer between the programmer and the hardware—in particular, to have an execution and data model where the programmer can cleanly reason about the mapping of their source program to compiled assembly language and the underlying hardware.
- To make it possible to harness the computational power of the SIMD vector units without the extremely low-programmer-productivity activity of directly writing intrinsics.
- To explore opportunities from close coupling between C/C++ application code and SPMD ispc code running on the same processor—to have lightweight function calls between the two languages, to share data directly via pointers without copying or reformatting, and so forth.
ispc is an open source compiler with a BSD license. It uses the LLVM Compiler Infrastructure for back-end code generation and optimization and is hosted on github. It supports Windows, Mac, and Linux, with both x86 and x86-64 targets. It currently supports the SSE2 and SSE4 instruction sets, though support for AVX should be available soon.
The performance of many math functions has improved with the release of the CUDA 4.0 Toolkit. This presentation includes the performance results of many of the key functions. Results include performance measurements for:
- cuFFT – Fast Fourier Transforms Library
- cuBLAS – Complete BLAS Library
- cuSPARSE – Sparse Matrix Library
- cuRAND – Random Number Generation (RNG) Library
- NPP – Performance Primitives for Image & Video Processing
- Thrust – Templated Parallel Algorithms & Data Structures
- math.h – C99 floating-point Library
A novel algorithm for solving in parallel a sparse triangular linear system on a graphical processing unit is proposed. It implements the solution of the triangular system in two phases. First, the analysis phase builds a dependency graph based on the matrix sparsity pattern and groups the independent rows into levels. Second, the solve phase obtains the full solution by iterating sequentially across the constructed levels. The solution elements corresponding to each single level are obtained at once in parallel. The numerical experiments are also presented and it is shown that the incomplete-LU and Cholesky preconditioned iterative methods, using the parallel sparse triangular solve algorithm, can achieve on average more than 2x speedup on graphical processing units (GPUs) over their CPU implementation.
(Maxim Naumov: “Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU”, NVIDIA Technical Report, June 2011. [WWW])