PARALUTION – A fast, user-friendly library for sparse iterative methods on CPUs and GPUs

February 25th, 2013

PARALUTION is a library for sparse iterative methods with special focus on multi-core and accelerator technology such as GPUs. In particular, it incorporates fine-grained parallel preconditioners designed to expolit modern multi-/many-core devices. Based on C++, it provides a generic and flexible design and interface which allow seamless integration with other scientific software packages. The library is open source and released under GPL. Key features are:

  • OpenMP, CUDA and OpenCL support
  • No special hardware/library requirement
  • Portable code and results across all hardware
  • Many sparse matrix formats
  • Various iterative solvers/preconditioners
  • Generic and robust design
  • Plug-in for the finite element package Deal.II
  • Documentation: user manual (pdf), reports, doxygen

More information, including documentation and case studies, is available at http://www.paralution.com.

rCUDA 4.0 released

December 18th, 2012

rCUDA (remote CUDA) v4.0 has just been released. It provides full binary compatibility with CUDA applications (no need to modify the application source code or recompile your program), native InfiniBand support, enhanced data transfers, and CUDA 5.0 API support (excluding graphics interoperability). This new release of rCUDA allows to execute existing GPU-accelerated applications by leveraging remote GPUs within a cluster (both via sharing and/or aggregating GPUs) with a negligible overhead. The new version is available free of charge ar www.rCUDA.net, along with examples, manuals and additional information.

ViennaCL 1.4.0 with CUDA, OpenCL and OpenMP support

December 3rd, 2012

The latest release 1.4.0 of the free open-source linear algebra library ViennaCL features the following highlights:

  • Two computing backends in addition to OpenCL: CUDA and OpenMP
  • Improved performance for (Block-) ILU0/ILUT preconditioners
  • Optional level scheduling for ILU substitutions on GPUs
  • Mixed-precision CG solver
  • Initializer types from Boost.uBLAS (unit_vector, zero_vector, etc.)

Any contributions of fast CUDA or OpenCL computing kernels for future releases of ViennaCL are welcome! More information is available at http://viennacl.sourceforge.net.

Policy-based Tuning for Performance Portability and Library Co-optimization

July 22nd, 2012

Abstract:

Although modular programming is a fundamental software development practice, software reuse within contemporary GPU kernels is uncommon. For GPU software assets to be reusable across problem instances, they must be inherently flexible and tunable. To illustrate, we survey the performance-portability landscape for a suite of common GPU primitives, evaluating thousands of reasonable program variants across a large diversity of problem instances (microarchitecture, problem size, and data type). While individual specializations provide excellent performance for specific instances, we find no variants with universally reasonable performance. In this paper, we present a policy-based design idiom for constructing reusable, tunable software components that can be co-optimized with the enclosing kernel for the specific problem and processor at hand. In particular, this approach enables flexible granularity coarsening which allows the expensive aspects of communication and the redundant aspects of data parallelism to scale with the width of the processor rather than the problem size. From a small library of tunable device subroutines, we have constructed the fastest, most versatile GPU primitives for reduction, prefix and segmented scan, duplicate removal, reduction-by-key, sorting, and sparse graph traversal.

(Duane Merrill, Michael Garland and Andrew Grimshaw, “Policy-based Tuning for Performance Portability and Library Co-optimization”, Innovative Parallel Computing 2012. [WWW])

SnuCL – OpenCL heterogeneous cluster computing

June 27th, 2012

SnuCL is an OpenCL framework and freely available, open-source software developed at Seoul National University. It naturally extends the original OpenCL semantics to the heterogeneous cluster environment. The target cluster consists of a single host node and multiple compute nodes. They are connected by an interconnection network, such as Gigabit and InfiniBand switches. The host node contains multiple CPU cores and each compute node consists of multiple CPU cores and multiple GPUs. For such clusters, SnuCL provides an illusion of a single heterogeneous system for the programmer. A GPU or a set of CPU cores becomes an OpenCL compute device. SnuCL allows the application to utilize compute devices in a compute node as if they were in the host node. Thus, with SnuCL, OpenCL applications written for a single heterogeneous system with multiple OpenCL compute devices can run on the cluster without any modifications. SnuCL achieves both high performance and ease of programming in a heterogeneous cluster environment.

SnuCL consists of SnuCL runtime and compiler. The SnuCL compiler is based on the OpenCL C compiler in SNU-SAMSUNG OpenCL framework. Currently, the SnuCL compiler supports x86, ARM, and PowerPC CPUs, AMD GPUs, and NVIDIA GPUs.

VexCL: Vector expression template library for OpenCL

May 30th, 2012

VexCL is vector expression template library for OpenCL developed by the Supercomputer Center of Russian academy of sciences. It has been created for ease of C++ based OpenCL development. Multi-device (and multi-platform) computations are supported. The code is publicly available under MIT license.

Main features:

  • Selection and initialization of compute devices according to extensible set of device filters.
  • Transparent allocation of device vectors spanning multiple devices.
  • Convenient notation for vector arithmetic, sparse matrix-vector multiplication, reductions. All computations are performed in parallel on all selected devices.
  • Appropriate kernels for vector expressions are generated automatically first time an expression is used.

Doxygen-generated documentation is available at http://ddemidov.github.com/vexcl/index.html.

CUVILib v1.2 released

May 17th, 2012

TunaCode has released CUVILib v1.2, a library to accelerate imaging and computer vision applications. CUVILib adds acceleration to Imaging applications from Medical, Industrial and Defense domains. It delivers very high performance and supports both CUDA and OpenCL. Modules include color operations (demosaic, conversions, correction etc), linear/non-linear filtering, feature extraction & tracking, motion estimation, image transforms and image statistics.

More information, including a free trial version: http://www.cuvilib.com/

New rCUDA version beta testing

April 18th, 2012

The rCUDA Team is proud to announce a new version of the rCUDA framework which will include many new functionalities as well as boosted performance. This new version, cooked for over a year, will incorporate pipelined transfers, full multi-thread and multi-node capabilities, CUDA 4.1 support, global scheduler integration, support for CUDA C extensions, and native InfiniBand support. A closed beta teting program has been started. See the complete text at http://www.rcuda.net/index.php/news/19-new-revolutionary-version-of-rcuda-to-be-launched.html.

SpeedIT 2.0 released

February 24th, 2012

SpeedIT 2.0 and the SpeedIT plugin to OpenFOAM have been released. New features include:

  • One of the fastest Sparse Matrix Vector Multiplication worldwide.
  • Faster Conjugate Gradient and BiConjugate Gradient solvers.
  • State-of-the-art CMRS format for storing sparse matrices. The format requires less memory than CRS or HYB (from CUSPARSE and CUSP).
  • Faster acceleration in OpenFOAM (Computational Fluid Dynamics).

More information is available at http://speed-it.vratis.com.

New CLOGS library with sort and scan primitives for OpenCL

February 5th, 2012

CLOGS is a library for higher-level operations on top of the OpenCL C++ API. It is designed to integrate with other OpenCL code, including synchronization using OpenCL events. Currently only two operations are supported: radix sorting and exclusive scan. Radix sort supports all the unsigned integral types as keys, and all the built-in scalar and vector types suitable for storage in buffers as values. Scan supports all the integral types. It also supports vector types, which allows for limited multi-scan capabilities.

Version 1.0 of the library has just been released. The home page is http://clogs.sourceforge.net/

Page 1 of 512345