High Performance Predictable Histogramming on GPUs

May 29th, 2011


Many image processing applications use the histogramming algorithm, which fills a set of bins according to the frequency of occurrence of pixel values taken from an input image. Histogramming has been mapped on a GPU prior to this work. Although significant research effort has been spent in optimizing the mapping, we show that the performance and performance predictability of existing methods can still be improved.

In this paper, we present two novel histogramming methods, both achieving a higher performance and predictability than existing methods. We discuss performance limitations for both novel methods by exploring algorithm trade-offs.

The first novel method gives an average performance increase of 33% over existing methods for non-synthetic benchmarks. The second novel method gives an average performance increase of 56% over existing methods and guarantees to be fully data independent. While the second method is specifically designed for Fermi GPU architectures, the first method is also suitable for older architectures.

(Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, Bart Mesman: “High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs”, GPGPU-4: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units. [DOI] [Paper and Source Code])


CfP: GPU Solutions to Multiscale Problems in Science and Engineering

May 29th, 2011

You are cordially invited to attend the GPU Solutions to Multiscale Problems in Science and Engineering Workshop 2011 (GPU-SMP’2011).  The workshop will be held in Dunhuang, Gansu, China July 18 – 21, 2011 (Monday through Thursday) with a reception on July 18th.  This workshop will cover topics on GPU Solutions to Multiscale Problems in Science and Engineering, including high performance computing methods, advanced software realization and construction of computing environments, and investigations of the mainstream developing trend and key scientific problems of GPUs in computing and visualization technology. Dunhuang is a beautiful city with a long history in China. During the workshop, attendees will have the opportunity to access local attractions.  Full details at http://gpu-smp2011.csp.escience.cn/


    CfP: Facing the Multicore-Challenge II

    May 11th, 2011

    Facing the Multicore Challenge II – Conference for Young Scientists, will be held September 28-30, 2011, at Karlsruhe Institute of Technology (KIT), Germany

    The conference focuses on topics of multi-/manycore and coprocessor technologies and the impact on computational science, day-to-day work, and for large-scale applications. The goal is to address and discuss current issues including mathematical modeling, numerical methods, design of parallel algorithms, aspects of microprocessor architecture, parallel programming languages, compilers, hardware-aware computing, heterogeneous platforms, emerging architectures, tools, performance tuning, and requirements for large-scale applications.

    The conference places emphasis on the support and advancement of young scientists in an interdisciplinary environment.
    You are cordially invited to submit a paper with unpublished and original  work. Furthermore, ongoing research can be presented in a short talk or poster.

    Read the rest of this entry »

    Alenka – SQL for CUDA

    May 11th, 2011

    Alenka is a columnar SQL-like language for data processing on CUDA hardware. Alenka uses vector based processing to perform SQL operations like joins, groups and sorts. The program is capable of processing very large data sets that do not fit into GPU or host memory: such sets are partitioned into pieces and processed separately. Get it here: https://sourceforge.net/projects/alenka/files/

    Mesh-particle interpolations on GPUs and multicore CPUs

    May 4th, 2011


    Particle–mesh interpolations are fundamental operations for particle-in-cell codes, as implemented in vortex methods, plasma dynamics and electrostatics simulations. In these simulations, the mesh is used to solve the field equations and the gradients of the fields are used in order to advance the particles. The time integration of particle trajectories is performed through an extensive resampling of the flow field at the particle locations. The computational performance of this resampling turns out to be limited by the memory bandwidth of the underlying computer architecture. We investigate how mesh–particle interpolation can be efficiently performed on graphics processing units (GPUs) and multicore central processing units (CPUs), and we present two implementation techniques. The single-precision results for the multicore CPU implementation show an acceleration of 45–70×, depending on system size, and an acceleration of 85–155× for the GPU implementation over an efficient single-threaded C++ implementation. In double precision, we observe a performance improvement of 30–40× for the multicore CPU implementation and 20–45× for the GPU implementation. With respect to the 16-threaded standard C++ implementation, the present CPU technique leads to a performance increase of roughly 2.8–3.7× in single precision and 1.7–2.4× in double precision, whereas the GPU technique leads to an improvement of 9× in single precision and 2.2–2.8× in double precision.

    (Diego Rossinelli, Christian Conti and Petros Koumoutsakos: “Mesh−particle interpolations on GPUs and multicore CPUs”, Phil. Trans. R. Soc. A 2011, 369:2164-2175 [doi])

    SGC Ruby CUDA 0.1.0 Release

    May 4th, 2011

    SGC Ruby CUDA has been heavily updated. It is now available from the standard Ruby Gems repository. Updates include:

    • Basic CUDA Driver and Runtime API support on CUDA 4.0rc2 with unit tests.
    • Object-Oriented API.
    • Exception classes for CUDA errors.
    • Support for Linux and Mac OSX platforms.
    • Documented with YARD.

    See http://blog.speedgocomputing.com/2011/04/first-release-of-sgc-ruby-cuda.html for more details.

    GPU Linear Solvers for OpenFOAM

    May 4th, 2011

    ofgpu is a free GPL library from Symscape that provides GPU linear solvers for OpenFOAM®. The experimental library targets NVIDIA CUDA devices on Windows, Linux, and (untested) Mac OS X. It uses the Cusp library’s Krylov solvers to produce equivalent GPU (CUDA-based) versions of the standard OpenFOAM linear solvers:

    • PCG – Preconditioned conjugate gradient solver for symmetric matrices (e.g., p)
    • PBiCG – Preconditioned biconjugate gradient solver for asymmetric matrices (e.g., Ux, k)

    ofgpu also has support for the OpenFOAM preconditioners:

    • no
    • diagonal

    For more details see “GPU Linear Solver Library for OpenFOAM”. OpenFOAM is a registered trademark of OpenCFD and is unaffiliated with Symscape.

    A memory efficient and fast sparse matrix vector product on a GPU

    May 4th, 2011


    This paper proposes a new sparse matrix storage format which allows an efficient implementation of a sparse matrix vector product on a Fermi Graphics Processing Unit (GPU). Unlike previous formats it has both low memory footprint and good throughput. The new format, which we call Sliced ELLR-T has been designed specifically for accelerating the iterative solution of a large sparse and complex-valued system of linear equations arising in computational electromagnetics. Numerical tests have shown that the performance of the new implementation reaches 69 GFLOPS in complex single precision arithmetic. Compared to the optimized six core Central Processing Unit (CPU) (Intel Xeon 5680) this performance implies a speedup by a factor of six. In terms of speed the new format is as fast as the best format published so far and at the same time it does not introduce redundant zero elements which have to be stored to ensure fast memory access. Compared to previously published solutions, significantly larger problems can be handled using low cost commodity GPUs with limited amount of on-board memory.

    (A. Dziekonski, A. Lamecki, and M. Mrozowski: “A memory efficient and fast sparse matrix vector product on a GPU“, Progress In Electromagnetics Research, Vol. 116, 49-63, 2011. [PDF])

    CfP: GPU and Hybrid Computing at PDP2012

    May 4th, 2011

    A special session on GPU and hybrid computing will be held in conjunction with PDP2012, the 20th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, in February 2012 in Garching, Germany. Submissions are cordially invited including but not limited to the following topics:

    • GPU computing, multi GPU processing, hybrid computing;
    • Programming models, programming frameworks, CUDA, OpenCL, communication libraries;
    • Mechanisms for mapping codes;
    • Task allocation;
    • Fault tolerance;
    • Performance analysis;
    • Applications: image processing, signal processing, linear algebra, numerical simulation, optimization; Domains: computer science, electronic, embedded systems, telecommunication, medical imaging, finance

    More information including submission and publication details are available at http://conf.laas.fr/GPU/.

    KGPU: enabling GPU computing in Linux kernel

    May 4th, 2011

    KGPU is a GPU computing framework for the Linux kernel. It allows the Linux kernel to directly execute CUDA programs running on GPUs. The motivation is to augment systems with GPUs so that like user-space applications, the operating system itself can benefit from the GPU acceleration. It can also offload computationally intensive work from the CPU by enabling the GPU as an extra computing device.

    The current KGPU release includes a demo task with GPU augmentation: a GPU AES cipher based eCryptfs, which is an encrypted file system on Linux. The read /write bandwidths are expected to be accelerated by a factor of 1.7 ~ 2.5 on an NVIDIA GeForce GTX 480 GPU.

    The source code can be obtained from https://github.com/wbsun/kgpu, and news and release information can be found at http://code.google.com/p/kgpu/.