PyViennaCL: Python wrapper for GPU-accelerated linear algebra

February 26th, 2014

The new free open-source PyViennaCL 1.0.0 release provides the Python bindings for the ViennaCL linear algebra and numerical computation library for GPGPU and heterogeneous systems. ViennaCL itself is a header-only C++ library, so these bindings make available to Python programmers ViennaCL’s fast OpenCL and CUDA algorithms, in a way that is idiomatic and compatible with the Python community’s most popular scientific packages, NumPy and SciPy. Support through the Google Summer of Code 2013 for the primary developer Toby St Clere Smithe is greatly appreciated.

More information and download: PyViennaCL Home

Linear Algebra Library ViennaCL 1.5.0 released

December 23rd, 2013

The latest release 1.5.0 of the free open source linear algebra library ViennaCL is now available for download. The library provides a high-level C++ API similar to Boost.ublas and aims at providing the performance of accelerators at a high level of convenience without having to deal with hardware details. Some of the highlights from the ChangeLog are as follows: Vectors and matrices of integers are now supported, multiple OpenCL contexts can be used in a fully multi-threaded manner, products of sparse and dense matrices are now available, and certain BLAS functionality is also provided through a shared library for use with programming languages other than C++, e.g. C, Fortran, or Python.

VexCL 1.0.0 released with CUDA support

November 20th, 2013

VexCL is a modern C++ library created for ease of GPGPU development with C++. VexCL strives to reduce the amount of boilerplate code needed to develop GPGPU applications. The library provides a convenient and intuitive notation for vector arithmetic, reduction, sparse matrix-vector multiplication, etc. The source code is available under the permissive MIT license. As of v1.0.0, VexCL provides two backends: OpenCL and CUDA. Users may choose either of those at compile time with a preprocessor macro definition. More information is available at the GitHub project page and release notes page.

Thrust v1.7 Released

July 4th, 2013

The Thrust team is pleased to announce the release of Thrust v1.7, an open-source C++ library for developing high-performance parallel applications. Modeled after the C++ Standard Template Library, Thrust brings a familiar abstraction layer to the realm of parallel computing

Thrust 1.7.0 introduces a new interface for controlling algorithm execution as well as several new algorithms and performance improvements. With this new interface, users may directly control how algorithms execute as well as details such as the allocation of temporary storage. Key/value versions of thrust::merge and the set operation algorithms have been added, as well stencil versions of partitioning algorithms. For 32b types, new CUDA merge and set operations provide 2-15x faster performance while a new CUDA comparison sort provides 1.3-4x faster performance.

Thrust is open-source software distributed under the OSI-approved Apache License 2.0.

PARALUTION – A fast, user-friendly library for sparse iterative methods on CPUs and GPUs

February 25th, 2013

PARALUTION is a library for sparse iterative methods with special focus on multi-core and accelerator technology such as GPUs. In particular, it incorporates fine-grained parallel preconditioners designed to expolit modern multi-/many-core devices. Based on C++, it provides a generic and flexible design and interface which allow seamless integration with other scientific software packages. The library is open source and released under GPL. Key features are:

  • OpenMP, CUDA and OpenCL support
  • No special hardware/library requirement
  • Portable code and results across all hardware
  • Many sparse matrix formats
  • Various iterative solvers/preconditioners
  • Generic and robust design
  • Plug-in for the finite element package Deal.II
  • Documentation: user manual (pdf), reports, doxygen

More information, including documentation and case studies, is available at

rCUDA 4.0 released

December 18th, 2012

rCUDA (remote CUDA) v4.0 has just been released. It provides full binary compatibility with CUDA applications (no need to modify the application source code or recompile your program), native InfiniBand support, enhanced data transfers, and CUDA 5.0 API support (excluding graphics interoperability). This new release of rCUDA allows to execute existing GPU-accelerated applications by leveraging remote GPUs within a cluster (both via sharing and/or aggregating GPUs) with a negligible overhead. The new version is available free of charge ar, along with examples, manuals and additional information.

ViennaCL 1.4.0 with CUDA, OpenCL and OpenMP support

December 3rd, 2012

The latest release 1.4.0 of the free open-source linear algebra library ViennaCL features the following highlights:

  • Two computing backends in addition to OpenCL: CUDA and OpenMP
  • Improved performance for (Block-) ILU0/ILUT preconditioners
  • Optional level scheduling for ILU substitutions on GPUs
  • Mixed-precision CG solver
  • Initializer types from Boost.uBLAS (unit_vector, zero_vector, etc.)

Any contributions of fast CUDA or OpenCL computing kernels for future releases of ViennaCL are welcome! More information is available at

Policy-based Tuning for Performance Portability and Library Co-optimization

July 22nd, 2012


Although modular programming is a fundamental software development practice, software reuse within contemporary GPU kernels is uncommon. For GPU software assets to be reusable across problem instances, they must be inherently flexible and tunable. To illustrate, we survey the performance-portability landscape for a suite of common GPU primitives, evaluating thousands of reasonable program variants across a large diversity of problem instances (microarchitecture, problem size, and data type). While individual specializations provide excellent performance for specific instances, we find no variants with universally reasonable performance. In this paper, we present a policy-based design idiom for constructing reusable, tunable software components that can be co-optimized with the enclosing kernel for the specific problem and processor at hand. In particular, this approach enables flexible granularity coarsening which allows the expensive aspects of communication and the redundant aspects of data parallelism to scale with the width of the processor rather than the problem size. From a small library of tunable device subroutines, we have constructed the fastest, most versatile GPU primitives for reduction, prefix and segmented scan, duplicate removal, reduction-by-key, sorting, and sparse graph traversal.

(Duane Merrill, Michael Garland and Andrew Grimshaw, “Policy-based Tuning for Performance Portability and Library Co-optimization”, Innovative Parallel Computing 2012. [WWW])

SnuCL – OpenCL heterogeneous cluster computing

June 27th, 2012

SnuCL is an OpenCL framework and freely available, open-source software developed at Seoul National University. It naturally extends the original OpenCL semantics to the heterogeneous cluster environment. The target cluster consists of a single host node and multiple compute nodes. They are connected by an interconnection network, such as Gigabit and InfiniBand switches. The host node contains multiple CPU cores and each compute node consists of multiple CPU cores and multiple GPUs. For such clusters, SnuCL provides an illusion of a single heterogeneous system for the programmer. A GPU or a set of CPU cores becomes an OpenCL compute device. SnuCL allows the application to utilize compute devices in a compute node as if they were in the host node. Thus, with SnuCL, OpenCL applications written for a single heterogeneous system with multiple OpenCL compute devices can run on the cluster without any modifications. SnuCL achieves both high performance and ease of programming in a heterogeneous cluster environment.

SnuCL consists of SnuCL runtime and compiler. The SnuCL compiler is based on the OpenCL C compiler in SNU-SAMSUNG OpenCL framework. Currently, the SnuCL compiler supports x86, ARM, and PowerPC CPUs, AMD GPUs, and NVIDIA GPUs.

VexCL: Vector expression template library for OpenCL

May 30th, 2012

VexCL is vector expression template library for OpenCL developed by the Supercomputer Center of Russian academy of sciences. It has been created for ease of C++ based OpenCL development. Multi-device (and multi-platform) computations are supported. The code is publicly available under MIT license.

Main features:

  • Selection and initialization of compute devices according to extensible set of device filters.
  • Transparent allocation of device vectors spanning multiple devices.
  • Convenient notation for vector arithmetic, sparse matrix-vector multiplication, reductions. All computations are performed in parallel on all selected devices.
  • Appropriate kernels for vector expressions are generated automatically first time an expression is used.

Doxygen-generated documentation is available at

Page 1 of 612345...Last »