Chai, a new managed platform for GPGPU

February 13th, 2012

Chai is a new managed platform for GPGPU. It is a free and open source clean room workalike of the PeakStream platform. While not production-ready, the just-released alpha version is able to compile and run non-trivial PeakStream demo code on AMD and NVIDIA GPUs (e.g. conjugate gradient).

Chai combines an application virtual machine, garbage collection, auto-tuning JIT compiler, and high level array programming language implemented as an embedded domain-specific language in C++. The JIT back-end uses expectation-maximization to auto-tune and generate vectorized OpenCL. The JIT includes auto-tuned model families for GEMM and GEMV. Although originally developed for AMD GPUs, these parameterized kernel families also generalize to NVIDIA GPUs.

Towards a complete FEM-based simulation toolkit on GPUs

February 10th, 2012


We describe our FE-gMG solver, a finite element geometric multigrid approach for problems relying on unstructured grids. We augment our GPU- and multicore-oriented implementation technique based on cascades of sparse matrix-vector multiplication by applying strong smoothers. In particular, we employ Sparse Approximate Inverse (SPAI) and Stabilised Approximate Inverse (SAINV) techniques. We focus on presenting the numerical efficiency of our smoothers in combination with low- and high-order finite element spaces as well as the hardware efficiency of the FE-gMG. For a representative problem and computational grids in 2D and 3D, we achieve a speedup of an average of 5 on a single GPU over a multithreaded CPU code in our benchmarks. In addition, our strong smoothers can deliver a speedup of 3-5 depending on the element space, compared to simple Jacobi smoothing. This can even be enhanced to a factor of 7 when combining the usage of Approximate Inverse-based smoothers with clever sorting of the degrees of freedom. In total the FE-gMG solver can outperform a simple, (multicore-)CPU-based multigrid by a total factor of over 40.

(Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac and Stefan Turek: “Towards a complete FEM-based simulation toolkit on GPUs: Unstructured Grid Finite Element Geometric Multigrid solvers with strong smoothers based on Sparse Approximate Inverses”, accepted for publication in Computers and Fluids, 2011. [preprint])

GPU and APU computations of Finite Time Lyapunov Exponent fields

February 1st, 2012

We present GPU and APU accelerated computations of Finite-Time Lyapunov Exponent (FTLE) fields. The calculation of FTLEs is a computationally intensive process, as in order to obtain the sharp ridges associated with the Lagrangian Coherent Structures an extensive resampling of the flow field is required. The computational performance of this resampling is limited by the memory bandwidth of the underlying computer architecture. The present technique harnesses data-parallel execution of many-core architectures and relies on fast and accurate evaluations of moment conserving functions for the mesh to particle interpolations. We demonstrate how the computation of FTLEs can be efficiently performed on a GPU and on an APU through OpenCL and we report over one order of magnitude improvements over multi-threaded executions in FTLE computations of bluff body flows. (Conti C., Rossinelli D., Koumoutsakos P., GPU and APU computations of Finite Time Lyapunov Exponent fields, Journal of Computational Physics, 231(5):2229–2244, 2012.

Submit your poster to GTC 2012 by February 2nd!

January 25th, 2012

Reminder: the deadline to submit a research poster for this year’s GPU Technology Conference is Thursday, February 2, 2012. Selected poster presenters receive a discount to attend GTC. They are required to attend the conference in order to present their work at the GTC Poster Showcase.   GTC will be held May 14-17 in San Jose, California.  For more information, see the call for participation and call for posters. To submit your poster abstract, visit

PyCOOL: Python Cosmological Object-Oriented Lattice code

January 25th, 2012

PyCOOL (Cosmological Object-Oriented Lattice code) is a fast GPU accelerated program that solves the evolution of interacting scalar fields in an expanding universe with symplectic algorithms. The program has been written with the intention to hit a sweet spot of speed, accuracy and user friendliness. This is achieved by using the Python language with the  PyCUDA interface to make a program that is very easy to adapt to different scalar field models.  The program is publicly available under GNU General Public License at. See the PyCOOL website for more information.

Using GPUs to Accelerate Installed Antenna Performance Simulations

January 9th, 2012


Savant is a asymptotic ray-tracing CEM tool used to predict the performance of antennas installed on electrically large platforms, including far-field antenna patterns, near-field distributions, and antenna-to-antenna coupling. Savant is based on the shooting and bouncing rays (SBR) formulation. While asymptotic solvers like Savant have significantly smaller computational and memory requirements for electrically large problems than full-wave techniques, the computation costs still increase significantly with frequency and simulation fidelity, and such solvers benefit greatly from parallelization techniques. Graphics processing units (GPUs) are throughput-oriented processing devices that are well suited for the mathematically intensive workloads found in CEM solvers. Current GPUs contain hundreds of processing units, leverage thousands of threads, and can execute over one trillion floating-point operations per second. A hybrid CPU and GPU parallelization approach has been developed for Savant, providing significant speedups compared to CPU-only implementations. Results from the execution of GPU-accelerated Savant on multiple case studies will be presented.

(T. Courtney, J. E. Stone and R. Kipp, “Using GPUs to Accelerate installed antenna performance simulations,” Proc. Allerton Antenna Symposium, Sept. 2011, Monticello, IL. [PDF])

CFP: High Performance Graphics 2012

January 6th, 2012

High Performance Graphics is the leading international forum for performance-oriented graphics systems research including innovative algorithms, efficient implementations, and hardware architecture. The conference brings together researchers, engineers, and architects to discuss the complex interactions of massively parallel hardware, novel programming models, efficient graphics algorithms, and novel applications. High Performance Graphics was founded in 2009 to synthesize and broaden on two important and well-respected conferences in computer graphics: Graphics Hardware and Interactive Ray Tracing.

HPG 2012 is co-sponsored by Eurographics and ACM SIGGRAPH and will take place on June 25-27, is co-located with the Eurographics Symposium on Rendering in Paris, France. We invite original and innovative performance-oriented contributions from all areas of graphics, including hardware architectures, rendering, physics, animation, simulation, and data structures, with topics including (but not limited to): Interactive rendering pipelines (hardware or software); Interactive rendering algorithms (hardware or software); Graphics hardware and systems; Languages and compilation; Parallel computing for graphics; and Mobile graphics. Please see the conference website for the full CFP.

CfP: High Performance Simulation of Biological Systems

January 4th, 2012

This workshop is organized by Horacio Pérez-Sánchez and José M. Cecilia and takes place in conjunction with the International Conference on Modeling & Applied Simulation (MAS 2012). The goal is to explore the use of emerging parallel computing architectures as well as High Performance Computing systems (Supercomputers, Clusters, Grids) for the simulation of relevant biological systems. We welcome papers, not submitted elsewhere for review, with a focus in topics of interest ranging from but not limited to:

  • Parallel stochastic simulation
  • Biological and Numerical parallel computing
  • Parallel and distributed architectures
  • Emerging processing architectures (e.g. GPUs, FPGAs, mixed CPU-GPU or CPU-FPGA)
  • Parallel Model checking techniques.
  • Parallel algorithms for biological analysis.
  • Cluster and Grid Deployment for system biology
  • Tools and applications
  • Biologically inspired algorithms.

More details, including dates, deadlines and submission instructions, are available on the workshop web page.

HOOMD-blue 0.10.0 release

December 19th, 2011

HOOMD-blue performs general-purpose particle dynamics simulations on a single workstation, taking advantage of NVIDIA GPUs to attain a level of performance equivalent to many cores on a fast cluster. Flexible and configurable, HOOMD-blue is currently being used for coarse-grained molecular dynamics simulations of nano-materials, glasses, and surfactants, dissipative particle dynamics simulations (DPD) of polymers, and crystallization of metals.

HOOMD-blue 0.10.0 adds many new features. Highlights include: Read the rest of this entry »

On the Acceleration of Wavefront Applications using Distributed Many-Core Architectures

December 14th, 2011


In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applications—a ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P).

Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures.

(Pennycook, S.J., Hammond, S.D., Mudalige, G.R., Wright, S.A. and Jarvis, S.A.: “On the Acceleration of Wavefront Applications using Distributed Many-Core Architectures”,  The Computer Journal (in press) [DOI] [PREPRINT])

Page 11 of 57« First...910111213...203040...Last »