CUDPP 2.2 released

September 4th, 2014

CUDPP release 2.2 is a feature release that adds a new parallel primitive and improves some existing primitives. We have added cudppSuffixArray, a parallel skew algorithm (SA) implementation that computes the suffix array of a string. This suffix array primitive is now used in burrowsWheelerTransform, delivering better performance than CUDPP 2.1’s use of cudppStringSort. The new BWT is further used in cudppCompress, which is now faster than the original parallel compression and supports compression of text containing all possible unsigned char values. Some bugs in cudppMoveToFrontTransform and cudppStringSort have also been fixed. OS X users might also be interested in how we supported the use of OS X’s clang compiler in OS X Mavericks (10.9).

Boost.Compute v0.3 Released

July 21st, 2014

Boost.Compute is a header-only C++ library for GPGPU and parallel-computing based on OpenCL. It provides a low-level C++ wrapper over OpenCL and high-level STL-like API with containers and algorithms for the GPU. It is available on GitHub and instructions for getting started can be found in the documentation. See the full announcement here:

BROCCOLI: Software for fast fMRI analysis on many-core CPUs and GPUs

May 27th, 2014


Analysis of functional magnetic resonance imaging (fMRI) data is becoming ever more computationally demanding as temporal and spatial resolutions improve, and large, publicly available data sets proliferate. Moreover, methodological improvements in the neuroimaging pipeline, such as non-linear spatial normalization, non-parametric permutation tests and Bayesian Markov Chain Monte Carlo approaches, can dramatically increase the computational burden. Despite these challenges, there do not yet exist any fMRI software packages which leverage inexpensive and powerful GPUs to perform these analyses. Here, we therefore present BROCCOLI, a free software package written in OpenCL that can be used for parallel analysis of fMRI data on a large variety of hardware configurations. BROCCOLI has, for example, been tested with an Intel CPU, an Nvidia GPU, and an AMD GPU. These tests show that parallel processing of fMRI data can lead to significantly faster analysis pipelines. This speedup can be achieved on relatively standard hardware, but further speed improvements require only a modest investment in GPU hardware. BROCCOLI (running on a GPU) can perform non-linear spatial normalization to a 1 mm3 brain template in 4–6 s, and run a second level permutation test with 10,000 permutations in about a minute. These non-parametric tests are generally more robust than their parametric counterparts, and can also enable more sophisticated analyses by estimating complicated null distributions. Additionally, BROCCOLI includes support for Bayesian first-level fMRI analysis using a Gibbs sampler. The new software is freely available under GNU GPL3 and can be downloaded from github:

(A. Eklund, P. Dufort, M. Villani and S. LaConte: “BROCCOLI: Software for fast fMRI analysis on many-core CPUs and GPUs”. Front. Neuroinform. 8:24, 2014. [DOI])

PARALUTION 0.7.0 released

May 27th, 2014

PARALUTION is a library for sparse iterative methods which can be performed on various parallel devices, including multi-core CPU, GPU (CUDA and OpenCL) and Intel Xeon Phi. The new 0.7.0 version provides the following new features:

  • Windows support – full windows support for all backends (CUDA, OpenCL, OpenMP)
  • Assembling function – new OpenMP parallel assembling function for sparse matrices (includes an update function for time-dependent problems)
  • Direct (dense) solvers (for very small problems)
  • (Restricted) Additive Schwarz preconditioners
  • MATLAB/Octave plug-in

To avoid OpenMP overhead for small sized problems, the library will compute in serial if the size of the matrix/vector is below a pre-defined threshold. Internally, the OpenCL backend has been modified for simplified cross platform compilation.

Boost.Compute v0.2 Released

May 15th, 2014

Boost.Compute v0.2 has been released! Boost.Compute is a header-only C++ library for GPGPU and parallel-computing based on OpenCL. It is available on GitHub and instructions for getting started can be found in the documentation. Since version 0.1 (released almost two months ago) new algorithms including unique(), search() and find_end() have been added, along with several bug fixes. See the project page on GitHub for more information:

PyViennaCL: Python wrapper for GPU-accelerated linear algebra

February 26th, 2014

The new free open-source PyViennaCL 1.0.0 release provides the Python bindings for the ViennaCL linear algebra and numerical computation library for GPGPU and heterogeneous systems. ViennaCL itself is a header-only C++ library, so these bindings make available to Python programmers ViennaCL’s fast OpenCL and CUDA algorithms, in a way that is idiomatic and compatible with the Python community’s most popular scientific packages, NumPy and SciPy. Support through the Google Summer of Code 2013 for the primary developer Toby St Clere Smithe is greatly appreciated.

More information and download: PyViennaCL Home

Linear Algebra Library ViennaCL 1.5.0 released

December 23rd, 2013

The latest release 1.5.0 of the free open source linear algebra library ViennaCL is now available for download. The library provides a high-level C++ API similar to Boost.ublas and aims at providing the performance of accelerators at a high level of convenience without having to deal with hardware details. Some of the highlights from the ChangeLog are as follows: Vectors and matrices of integers are now supported, multiple OpenCL contexts can be used in a fully multi-threaded manner, products of sparse and dense matrices are now available, and certain BLAS functionality is also provided through a shared library for use with programming languages other than C++, e.g. C, Fortran, or Python.

VexCL 1.0.0 released with CUDA support

November 20th, 2013

VexCL is a modern C++ library created for ease of GPGPU development with C++. VexCL strives to reduce the amount of boilerplate code needed to develop GPGPU applications. The library provides a convenient and intuitive notation for vector arithmetic, reduction, sparse matrix-vector multiplication, etc. The source code is available under the permissive MIT license. As of v1.0.0, VexCL provides two backends: OpenCL and CUDA. Users may choose either of those at compile time with a preprocessor macro definition. More information is available at the GitHub project page and release notes page.

Thrust v1.7 Released

July 4th, 2013

The Thrust team is pleased to announce the release of Thrust v1.7, an open-source C++ library for developing high-performance parallel applications. Modeled after the C++ Standard Template Library, Thrust brings a familiar abstraction layer to the realm of parallel computing

Thrust 1.7.0 introduces a new interface for controlling algorithm execution as well as several new algorithms and performance improvements. With this new interface, users may directly control how algorithms execute as well as details such as the allocation of temporary storage. Key/value versions of thrust::merge and the set operation algorithms have been added, as well stencil versions of partitioning algorithms. For 32b types, new CUDA merge and set operations provide 2-15x faster performance while a new CUDA comparison sort provides 1.3-4x faster performance.

Thrust is open-source software distributed under the OSI-approved Apache License 2.0.

PARALUTION – A fast, user-friendly library for sparse iterative methods on CPUs and GPUs

February 25th, 2013

PARALUTION is a library for sparse iterative methods with special focus on multi-core and accelerator technology such as GPUs. In particular, it incorporates fine-grained parallel preconditioners designed to expolit modern multi-/many-core devices. Based on C++, it provides a generic and flexible design and interface which allow seamless integration with other scientific software packages. The library is open source and released under GPL. Key features are:

  • OpenMP, CUDA and OpenCL support
  • No special hardware/library requirement
  • Portable code and results across all hardware
  • Many sparse matrix formats
  • Various iterative solvers/preconditioners
  • Generic and robust design
  • Plug-in for the finite element package Deal.II
  • Documentation: user manual (pdf), reports, doxygen

More information, including documentation and case studies, is available at

Page 1 of 612345...Last »