OpenNL 3.0: CUDA sparse linear solvers

February 14th, 2010

OpenNL (Open Numerical Library) is a library for solving sparse linear systems, especially designed for the Computer Graphics community. The goal of OpenNL is to be as small as possible, while offering the subset of functionalities required by this application field. The Makefiles of OpenNL can generate a single .c and .h file that make it very easy to integrate into other projects. The distribution includes an implementation of a Least Squares Conformal Maps parameterization method.  The new version 3.0 of OpenNL includes support for CUDA (with Concurrent Number Cruncher and CUSP ELL formats).

Harnessing Graphics Processors for the Fast Computation of Acoustic Likelihoods in Speech Recognition

February 10th, 2010


In large vocabulary continuous speech recognition (LVCSR) the acoustic model computations often account for the largest processing overhead. Our weighted finite state transducer (WFST) based decoding engine can utilize a commodity graphics processing unit (GPU) to perform the acoustic computations to move this burden off the main processor. In this paper we describe our new GPU scheme that can achieve a very substantial improvement in recognition speed whilst incurring no reduction in recognition accuracy. We evaluate the GPU technique on a large vocabulary spontaneous speech recognition task using a set of acoustic models with varying complexity and the results consistently show by using the GPU it is possible to reduce the recognition time with largest improvements occurring in systems with large numbers of Gaussians. For the systems which achieve the best accuracy we obtained between 2.5 and 3 times speed-ups. The faster decoding times translate to reductions in space, power and hardware costs by only requiring standard hardware that is already widely installed.

(Paul R. Dixon, Tasuku Oonishi, Sadaoki Furui, “Harnessing graphics processors for the fast computation of acoustic likelihoods in speech recognition”, Computer Speech & Language, Volume 23, Issue 4, October 2009, Pages 510-526, ISSN 0885-2308, DOI: 10.1016/j.csl.2009.03.005)

CfP: GPU Computing Gems

February 9th, 2010

NVIDIA and Editor-in-Chief  Professor Wen-mei Hwu of the University of Illinois, Urbana-Champaign invite you to submit articles for GPU Computing Gems, a contribution-based book that will focus on practical techniques for GPU computing. This is a continuation of the popular GPU Gems series.

The full Call for Participation is available here.

CfP: High performance computational systems Biology

February 8th, 2010

The HiBi workshop establishes a forum to link researchers in the areas of parallel computing and computational systems biology. One of the main limitations in managing models of biological systems comes from the fundamental difference between the high parallelism evident in biochemical reactions and the sequential environments employed for the analysis of these reactions. Such limitations affect all varieties of continuous, deterministic, discrete and stochastic models; undermining the applicability of simulation techniques and analysis of biological models. The goal of HiBi is therefore to bring together researchers in the fields of high performance computing and computational systems biology. Experts from around the world will present their current work, discuss
profound challenges, new ideas, results, applications and their experience relating to key aspects of high performance computing in biology.

Topics of interest include, but are not limited to:

  • Parallel stochastic simulation
  • Biological and Numerical parallel computing
  • Parallel and distributed architectures
  • Emerging processing architecture: Cell processors, GPUs, mixed CPU-FPGA, etc.
  • Parallel model checking techniques
  • Parallel parameter estimation
  • Parallel algorithms for biological analysis
  • Application of concurrency theory to biology
  • Parallel visualization algorithms
  • Web-services and Internet computing for e-Science
  • Tools and applications

More Information:

CfP: Symposium on chemical computations on GP-GPUs

February 8th, 2010

The symposium will provide technical presentations from the companies advancing the development of GPUs, discussions of the challenges involved in effectively programming GPUs, and presentations on the use of GPUs in a range of chemical applications.

The deadline for submissions is 04/05/2010, and more information can be found at

CfP: High Performance Graphics 2010

February 7th, 2010

High-Performance Graphics 2010 continues last year’s success at synthesizing two important and cutting-edge topics in computer graphics, the previous Graphics Hardware and Interactive Ray Tracing conferences. The scope of the conference is the overarching field of performance-oriented graphics systems, covering innovative algorithms, efficient implementations, and hardware architecture. This broader focus offers a common forum bringing together researchers, engineers, and architects to discuss the complex interactions of massively parallel hardware, novel programming models, efficient graphics algorithms, and innovative applications.

The program features three days of paper and industry presentations, with ample time for discussions during breaks, lunches, and the  conference banquet. The conference, which will take place on June 25-27, is co-located with Eurographics Rendering Symposium on the campus of the Max-Planck  Institut Informatik, Saarland University, Saarbrucken, Germany.

Original and innovative performance-oriented contributions are invited from all areas of graphics, including hardware architectures, rendering, physics, animation, AI, simulation, data structures, with topics including (but not limited to):

  • New graphics hardware architectures
  • Rendering architectures and algorithms
  • Parallel computing for graphics (including GPU Computing)
  • Algorithmic foundations
  • Languages and compilation

The conference website with additional information is located at

Triangular matrix inversion on Graphics Processing Unit

February 6th, 2010


Dense matrix inversion is a basic procedure in many linear algebra algorithms. A computationally arduous step in most dense matrix inversion methods is the inversion of triangular matrices as produced by factorization methods such as LU decomposition. In this paper, we demonstrate how triangular matrix inversion (TMI) can be accelerated considerably by using commercial Graphics Processing Units (GPU) in a standard PC. Our implementation is based on a divide and conquer type recursive TMI algorithm, efficiently adapted to the GPU architecture. Our implementation obtains a speedup of 34x versus a CPU-based LAPACK reference routine, and runs at up to 54 gigaflops/s on a GTX 280 in double precision. Limitations of the algorithm are discussed, and strategies to cope with them are introduced. In addition, we show how inversion of an L- and U-matrix can be performed concurrently on a GTX 295 based dual-GPU system at up to 90 gigaflops/s.

(Florian Ries, Tommaso De Marco, Matteo Zivieri and Roberto Guerrieri, Triangular Matrix Inversion on Graphics Processing Units, Supercomputing 2009, DOI 10.1145/1654059.1654069)

HONEI: A collection of libraries for numerical computations targeting multiple processor architectures

February 2nd, 2010


We present HONEI, an open-source collection of libraries offering a hardware oriented approach to numerical calculations. HONEI abstracts the hardware, and applications written on top of HONEI can be executed on a wide range of computer architectures such as CPUs, GPUs and the Cell processor. We demonstrate the flexibility and performance of our approach with two test applications, a Finite Element multigrid solver for the Poisson problem and a robust and fast simulation of shallow water waves. By linking against HONEI’s libraries, we achieve a two-fold speedup over straight forward C++ code using HONEI’s SSE backend, and additional 3–4 and 4–16 times faster execution on the Cell and a GPU. A second important aspect of our approach is that the full performance capabilities of the hardware under consideration can be exploited by adding optimised application-specific operations to the HONEI libraries. HONEI provides all necessary infrastructure for development and evaluation of such kernels, significantly simplifying their development.

(Danny van Dyk, Markus Geveler, Sven Mallach, Dirk Ribbrock, Dominik Göddeke and Carsten Gutwenger: HONEI: A collection of libraries for numerical computations targeting multiple processor architectures. Computer Physics Communications 180(12), pp. 2534-2543, December 2009. DOI 10.1016/j.cpc.2009.04.018)

FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs

February 2nd, 2010


As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moore’s law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs and FPGAs are becoming very popular in PC-based heterogeneous systems for speeding up compute intensive kernels of scientific, imaging and simulation applications. GPUs can execute hundreds of concurrent threads, while FPGAs provide customized concurrency for highly parallel kernels. However, exploiting the parallelism available in these applications is currently not a push-button task. Often the programmer has to expose the application’s fine and coarse grained parallelism by using special APIs. CUDA is such a parallel-computing API that is driven by the GPU industry and is gaining significant popularity. In this work, we adapt the CUDA programming model into a new FPGA design flow called FCUDA, which efficiently maps the coarse and fine grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SPMD CUDA thread blocks into parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multi-core accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.

(Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, Jason Cong and Wen-Mei W. Hwu, FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs, Proceedings of the 7th Symposium on Application Specific Processors, pp.35-42, July 2009. DOI: 10.1109/SASP.2009.5226333)

Molecular Workshop Series at Stanford

February 2nd, 2010

Molecular Workshop Series – Running and Developing MD Algorithms on GPUs with OpenMM and PyOpenMM + Intro to MD and Trajectory Analysis

Simbios is excited to announce its upcoming Molecular Dynamics (MD) Workshop Series, highlighting new capabilities within the recently released OpenMM 1.0 and introducing PyOpenMM for rapid MD code development with high performance:

Day 1: Running and Developing MD Algorithms on GPUs with OpenMM
Day 2: Introduction to MD and Trajectory Analysis with Markov State Models

When: March 1-2, 2010 (sign up for one or two days)
Where: Stanford University

Registration is free but required and spaces are limited. Please visit for the workshop agenda and to register.

Page 28 of 58« First...1020...2627282930...4050...Last »