VexCL is a modern C++ library created for ease of GPGPU development with C++. VexCL strives to reduce the amount of boilerplate code needed to develop GPGPU applications. The library provides a convenient and intuitive notation for vector arithmetic, reduction, sparse matrix-vector multiplication, etc. The source code is available under the permissive MIT license. As of v1.0.0, VexCL provides two backends: OpenCL and CUDA. Users may choose either of those at compile time with a preprocessor macro definition. More information is available at the GitHub project page and release notes page.
The Thrust team is pleased to announce the release of Thrust v1.7, an open-source C++ library for developing high-performance parallel applications. Modeled after the C++ Standard Template Library, Thrust brings a familiar abstraction layer to the realm of parallel computing
Thrust 1.7.0 introduces a new interface for controlling algorithm execution as well as several new algorithms and performance improvements. With this new interface, users may directly control how algorithms execute as well as details such as the allocation of temporary storage. Key/value versions of thrust::merge and the set operation algorithms have been added, as well stencil versions of partitioning algorithms. For 32b types, new CUDA merge and set operations provide 2-15x faster performance while a new CUDA comparison sort provides 1.3-4x faster performance.
Thrust is open-source software distributed under the OSI-approved Apache License 2.0.
PARALUTION is a library for sparse iterative methods with special focus on multi-core and accelerator technology such as GPUs. In particular, it incorporates fine-grained parallel preconditioners designed to expolit modern multi-/many-core devices. Based on C++, it provides a generic and flexible design and interface which allow seamless integration with other scientific software packages. The library is open source and released under GPL. Key features are:
- OpenMP, CUDA and OpenCL support
- No special hardware/library requirement
- Portable code and results across all hardware
- Many sparse matrix formats
- Various iterative solvers/preconditioners
- Generic and robust design
- Plug-in for the finite element package Deal.II
- Documentation: user manual (pdf), reports, doxygen
More information, including documentation and case studies, is available at http://www.paralution.com.
rCUDA (remote CUDA) v4.0 has just been released. It provides full binary compatibility with CUDA applications (no need to modify the application source code or recompile your program), native InfiniBand support, enhanced data transfers, and CUDA 5.0 API support (excluding graphics interoperability). This new release of rCUDA allows to execute existing GPU-accelerated applications by leveraging remote GPUs within a cluster (both via sharing and/or aggregating GPUs) with a negligible overhead. The new version is available free of charge ar www.rCUDA.net, along with examples, manuals and additional information.
The latest release 1.4.0 of the free open-source linear algebra library ViennaCL features the following highlights:
- Two computing backends in addition to OpenCL: CUDA and OpenMP
- Improved performance for (Block-) ILU0/ILUT preconditioners
- Optional level scheduling for ILU substitutions on GPUs
- Mixed-precision CG solver
- Initializer types from Boost.uBLAS (unit_vector, zero_vector, etc.)
Any contributions of fast CUDA or OpenCL computing kernels for future releases of ViennaCL are welcome! More information is available at http://viennacl.sourceforge.net.
Although modular programming is a fundamental software development practice, software reuse within contemporary GPU kernels is uncommon. For GPU software assets to be reusable across problem instances, they must be inherently flexible and tunable. To illustrate, we survey the performance-portability landscape for a suite of common GPU primitives, evaluating thousands of reasonable program variants across a large diversity of problem instances (microarchitecture, problem size, and data type). While individual specializations provide excellent performance for specific instances, we find no variants with universally reasonable performance. In this paper, we present a policy-based design idiom for constructing reusable, tunable software components that can be co-optimized with the enclosing kernel for the specific problem and processor at hand. In particular, this approach enables flexible granularity coarsening which allows the expensive aspects of communication and the redundant aspects of data parallelism to scale with the width of the processor rather than the problem size. From a small library of tunable device subroutines, we have constructed the fastest, most versatile GPU primitives for reduction, prefix and segmented scan, duplicate removal, reduction-by-key, sorting, and sparse graph traversal.
(Duane Merrill, Michael Garland and Andrew Grimshaw, “Policy-based Tuning for Performance Portability and Library Co-optimization”, Innovative Parallel Computing 2012. [WWW])
SnuCL is an OpenCL framework and freely available, open-source software developed at Seoul National University. It naturally extends the original OpenCL semantics to the heterogeneous cluster environment. The target cluster consists of a single host node and multiple compute nodes. They are connected by an interconnection network, such as Gigabit and InfiniBand switches. The host node contains multiple CPU cores and each compute node consists of multiple CPU cores and multiple GPUs. For such clusters, SnuCL provides an illusion of a single heterogeneous system for the programmer. A GPU or a set of CPU cores becomes an OpenCL compute device. SnuCL allows the application to utilize compute devices in a compute node as if they were in the host node. Thus, with SnuCL, OpenCL applications written for a single heterogeneous system with multiple OpenCL compute devices can run on the cluster without any modifications. SnuCL achieves both high performance and ease of programming in a heterogeneous cluster environment.
SnuCL consists of SnuCL runtime and compiler. The SnuCL compiler is based on the OpenCL C compiler in SNU-SAMSUNG OpenCL framework. Currently, the SnuCL compiler supports x86, ARM, and PowerPC CPUs, AMD GPUs, and NVIDIA GPUs.
VexCL is vector expression template library for OpenCL developed by the Supercomputer Center of Russian academy of sciences. It has been created for ease of C++ based OpenCL development. Multi-device (and multi-platform) computations are supported. The code is publicly available under MIT license.
- Selection and initialization of compute devices according to extensible set of device filters.
- Transparent allocation of device vectors spanning multiple devices.
- Convenient notation for vector arithmetic, sparse matrix-vector multiplication, reductions. All computations are performed in parallel on all selected devices.
- Appropriate kernels for vector expressions are generated automatically first time an expression is used.
Doxygen-generated documentation is available at http://ddemidov.github.com/vexcl/index.html.
TunaCode has released CUVILib v1.2, a library to accelerate imaging and computer vision applications. CUVILib adds acceleration to Imaging applications from Medical, Industrial and Defense domains. It delivers very high performance and supports both CUDA and OpenCL. Modules include color operations (demosaic, conversions, correction etc), linear/non-linear filtering, feature extraction & tracking, motion estimation, image transforms and image statistics.
More information, including a free trial version: http://www.cuvilib.com/
The rCUDA Team is proud to announce a new version of the rCUDA framework which will include many new functionalities as well as boosted performance. This new version, cooked for over a year, will incorporate pipelined transfers, full multi-thread and multi-node capabilities, CUDA 4.1 support, global scheduler integration, support for CUDA C extensions, and native InfiniBand support. A closed beta teting program has been started. See the complete text at http://www.rcuda.net/index.php/news/19-new-revolutionary-version-of-rcuda-to-be-launched.html.