OpenFOAM SpeedIT plugin 1.1 released

November 27th, 2010

The OpenFOAM SpeedIT plugin version 1.1 has been released under the GPL License. The most important new features are:

  • Multi-GPU support
  • Tested on Fermi architecture (GTX460 and Tesla C2050)
  • Automated submission of the domain to the GPU cards (using decomposePar from OpenFOAM)
  • Optimized submission of computational tasks to the best GPU card in the system for any number of computational threads
  • Plugin picks the most powerful GPU card for a single thread cases

The OpenFOAM SpeedIT plugin is available at http://speedit.vratis.com.

A Fast GEMM Implementation on a Cypress GPU

October 12th, 2010

Abstract:

We present benchmark results of optimized dense matrix multiplication kernels for a Cypress GPU. We write general matrix multiply (GEMM) kernels for single (SP), double (DP) and double-double (DDP) precision. Our SGEMM and DGEMM kernels show 73% and 87% of the theoretical performance of the GPU, respectively. Currently, our SGEMM and DGEMM kernels are fastest with one GPU chip to our knowledge. Furthermore, the performance of our matrix multiply kernel in DDP is 31 Gflop/s. This performance in DDP is more than 200 times faster than the performance in DDP on single core of a recent CPU (with mpack version 0.6.5). We describe our GEMM kernels with main focus on the SGEMM implementation since all GEMM kernels share common programming and optimization techniques. While a conventional wisdom of GPU programming recommends us to heavily use shared memory on GPUs, we show that texture cache is very effective on the Cypress architecture.

(N. Nakasato: “A Fast GEMM Implementation on a Cypress GPU”, 1st International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10) November 2010. A sample program is available at http://github.com/dadeba/dgemm_cypress)

PyCULA: Python Bindings for CULA GPGPU LAPACK

September 30th, 2010

PyCULA is a module providing transparent PyCUDA and ctypes based Python bindings for CULAtools LAPACK by Louis Theran and Garrett Wright of Temple University. It provides support for mixing PyCUDA-style kernel code with CULA device functions and also has a complete set of ctypes wrappers for CULA.

Key Features Include:

  • Reduce Memory Leaks by using Automatic Memory Management (via PyCUDA)
  • Utilize both simple Numpy style and GPUArray manual device style interfaces.
  • Supports mixing LAPACK via CULA with your Custom Kernels.
  • Combine seamlessly with handy Python modules like SQL, gzip, SciPy, R, etc.
  • Develop, Debug, Optimize, and Get Help right at the interactive command line.

The PyCULA0.9a4 alpha release is avaiable at http://pypi.python.org/pypi/PyCULA/0.9a4. PyCULA was developed as part of the ASU/Temple Zeolite Project, which is supported by CDI-I grant DMR 0835586 to Igor Rivin and M. M. J. Treacy.

SMVM on GPU

July 29th, 2010

From the paper’s abstract:

A wide class of finite element electromagnetic applications requires computing very large sparse matrix vector multiplications (SMVM). Due to the sparsity pattern and size of the matrices, solvers can run relatively slowly. The rapid evolution of graphic processing units (GPUs) in performance, architecture and programmability make them very attractive platforms for accelerating computationally intensive kernels such as SMVM. This work presents a new algorithm to accelerate the performance of the SMVM kernel on graphic processing units.

From the paper’s conclusion:

We have introduced several efficient techniques to accelerate the execution of the sparse matrix vector multiplication (SMVM) on NVIDIA graphic processing units. The proposed methods increased the performance of the SMVM kernel on GT 8800 up to 18.8 times compared to the quad core CPU and 3 times compared to previous work by Bell and Garland on accelerating SMVM for GPUs.

(M. Mehri Dehnavi, D. Fernandez and D. Giannacopoulos: “Finite element sparse matrix vector multiplication on GPUs”. IEEE Transactions on Magnetics, vol. 46, no. 8, pp. 2982-2985, August 2010. DOI 10.1109/TMAG.2010.2043511)

CULA 2.0 released

July 11th, 2010

EM Photonics announced today the general availability of CULA 2.0, its GPU-accelerated linear algebra library. The new version provides support for NVIDIA GPUs based on the latest “Fermi” architecture.

CULA contains a LAPACK interface comprised of over 150 mathematical routines from the industry standard for computational linear algebra, LAPACK. EM Photonics’ CULA library includes many popular routines including system solvers, least squares solvers, orthogonal factorizations, eigenvalue routines, and singular value decompositions. CULA offers performance up to a magnitude faster than highly optimized CPU-based linear algebra solvers. There is a variety of different interfaces available to integrate directly into your existing code. Programmers can easily call GPU-accelerated CULA from their C/C++, FORTRAN, MATLAB, or Python codes. This can all be done with no GPU programming experience. CULA is available for every system equipped with GPUs based on the NVIDIA CUDA architecture. This includes 32- and 64-bit versions of Linux, Windows, and OS X.

More information is available at www.culatools.com.

ViennaCL: Linear Algebra on GPUs using OpenCL

June 15th, 2010

The Vienna Computing Library (ViennaCL) is a scientific computing library written in C++ and based on OpenCL. It allows simple, high-level access to the vast computing resources available on parallel architectures such as GPUs and is primarily focused on common linear algebra operations (BLAS level 1 and 2) and the solution of large systems of equations by means of iterative methods. The following iterative solvers are implemented:

  • Conjugate Gradient (CG)
  • Stabilized BiConjugate Gradient (BiCGStab)
  • Generalized Minimum Residual (GMRES)

Read the rest of this entry »

SpeedIT Toolkit 0.9.1 released

March 26th, 2010

The SpeedIT Tools library provides a set of accelerated solvers for sparse linear systems of equations. Manifold acceleration, e.g. more than an order of magnitude, is achieved with a single reasonably priced NVIDIA Graphics Processing Unit (GPU) that supports CUDA and proprietary advanced optimization techniques. The library can be used in a wide spectrum of domains arising from problems with underlying 2D and 3D geometry, such as computational fluid dynamics, electro-magnetics, thermodynamics, materials, acoustics, computer vision and graphics, robotics, semiconductor devices and structural engineering. The library can be also used for problems without defined geometry such as quantum chemistry, statistics, power networks and other graphs and chemical process simulation. All computations are performed with single or double floating point precision. Two version of SpeedIT toolkit have been released: The classic version provides a conjugate gradient solver, and the extreme edition provides optimized CG, BiCGSTAB, diagonal preconditioner, memory management, and heuristic-based analysis of input matrices.

Triangular matrix inversion on Graphics Processing Unit

February 6th, 2010

Abstract:

Dense matrix inversion is a basic procedure in many linear algebra algorithms. A computationally arduous step in most dense matrix inversion methods is the inversion of triangular matrices as produced by factorization methods such as LU decomposition. In this paper, we demonstrate how triangular matrix inversion (TMI) can be accelerated considerably by using commercial Graphics Processing Units (GPU) in a standard PC. Our implementation is based on a divide and conquer type recursive TMI algorithm, efficiently adapted to the GPU architecture. Our implementation obtains a speedup of 34x versus a CPU-based LAPACK reference routine, and runs at up to 54 gigaflops/s on a GTX 280 in double precision. Limitations of the algorithm are discussed, and strategies to cope with them are introduced. In addition, we show how inversion of an L- and U-matrix can be performed concurrently on a GTX 295 based dual-GPU system at up to 90 gigaflops/s.

(Florian Ries, Tommaso De Marco, Matteo Zivieri and Roberto Guerrieri, Triangular Matrix Inversion on Graphics Processing Units, Supercomputing 2009, DOI 10.1145/1654059.1654069)

CULAtools: GPU-accelerated LAPACK

August 23rd, 2009

EM Photonics has recently released a preview beta edition of their CULAtools, an implementation of LAPACK for CUDA-enabled GPUs. This version comprises single precision LU decomposition, QR factorization, singular value decomposition and least squares. The full library, scheduled for release at NVIDIA GTC ’09, will contain much more functionality and in particular single- and double-precision computations. Please refer to the website culatools.com for details, licenses and downloads.

MAGMA: LAPACK for GPUs and Multicore architectures

August 6th, 2009

The MAGMA project aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures, starting with current “Multicore+GPU” systems.

The MAGMA research is based on the idea that, to address the complex challenges of the emerging hybrid environments, optimal software solutions will themselves have to hybridized, combining the strengths of different algorithms within a single framework. Building on this idea, the MAGMA group aims to design linear algebra algorithms and frameworks for hybrid manycore and GPU systems that can enable applications to fully exploit the power that each of the hybrid components offers.

MAGMA v0.1 runs on CUDA-capable GPUs and multicore CPUs, and is available now.

Page 2 of 512345