MATLAB Adds GPU Support

October 13th, 2010

Michael Feldman of HPCWire writes:

MATLAB users with a taste for GPU computing now have a perfect reason to move up to the latest version. Release R2010b adds native GPGPU support that allows user to harness NVIDIA graphics processors for engineering and scientific computing. The new capability is provided within the Parallel Computing Toolbox and Distributed Computing Server.

Full details of  MATLAB Release R1020b are available on the Mathworks site.  Information on other numerical packages accelerated using NVIDIA CUDA is available on NVIDIA’s site.

[Editor’s Note: as pointed out in the comments by John Melanakos (from Accelereyes),  it may be worth checking out how MATLAB 2010b GPU support currently compares to Accelereyes Jacket.]

SpeedIT Toolkit 0.9.1 released

March 26th, 2010

The SpeedIT Tools library provides a set of accelerated solvers for sparse linear systems of equations. Manifold acceleration, e.g. more than an order of magnitude, is achieved with a single reasonably priced NVIDIA Graphics Processing Unit (GPU) that supports CUDA and proprietary advanced optimization techniques. The library can be used in a wide spectrum of domains arising from problems with underlying 2D and 3D geometry, such as computational fluid dynamics, electro-magnetics, thermodynamics, materials, acoustics, computer vision and graphics, robotics, semiconductor devices and structural engineering. The library can be also used for problems without defined geometry such as quantum chemistry, statistics, power networks and other graphs and chemical process simulation. All computations are performed with single or double floating point precision. Two version of SpeedIT toolkit have been released: The classic version provides a conjugate gradient solver, and the extreme edition provides optimized CG, BiCGSTAB, diagonal preconditioner, memory management, and heuristic-based analysis of input matrices.

PASCO 2010: Call for Papers

March 9th, 2010

The International Workshop on Parallel and Symbolic Computation (PASCO) is a series of workshops dedicated to the promotion and advancement of parallel algorithms and software in all areas of symbolic mathematical computation. The pervasive ubiquity of parallel architectures and memory hierarchy has led to the emergence of a new quest for parallel mathematical algorithms and software capable of exploiting the various levels of parallelism: from hardware acceleration technologies (multi-core and multi-processor system on chip, GPGPU, FPGA) to cluster and global computing platforms. To push up the limits of symbolic and algebraic computations, beyond the optimization of the application itself, the effective use of a large number of resources -memory and general or specialized computing units- is expected to enhance the performance multi-criteria objectives: time, energy consumption, resource usage, reliability. In this context, the design and the implementation of mathematical algorithms with provable and adaptive performances is a major challenge.

The workshop PASCO 2010 will be a three-day event including invited presentations and tutorials, contributed research papers and posters, and a programming contest. Specific topics include, but are not limited to: Read the rest of this entry »

Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid

March 3rd, 2010

Abstract:

We have previously suggested mixed precision iterative solvers specifically tailored to the iterative solution of sparse linear equation systems as they typically arise in the finite element discretization of partial differential equations. These schemes have been evaluated for a number of hardware platforms, in particular single precision GPUs as accelerators to the general purpose CPU. This paper reevaluates the situation with new mixed precision solvers that run entirely on the GPU: We demonstrate that mixed precision schemes constitute a significant performance gain over native double precision. Moreover, we present a new implementation of cyclic reduction for the parallel solution of tridiagonal systems and employ this scheme as a line relaxation smoother in our GPU-based multigrid solver. With an alternating direction implicit variant of this advanced smoother we can extend the applicability of the GPU multigrid solvers to very ill-conditioned systems arising from the discretization on anisotropic meshes, that previously had to be solved on the CPU. The resulting mixed precision schemes are always faster than double precision alone, and outperform tuned CPU solvers consistently by almost an order of magnitude.

(Dominik Göddeke and Robert Strzodka: “Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid” , accepted in: IEEE Transactions on Parallel and Distributed Systems, Special Issue: High Performance Computing with Accelerators, Mar. 2010. Link.)

OpenCurrent v1.0 released: CUDA-accelerated PDE solver

September 28th, 2009

OpenCurrent is an open source C++ library for solving Partial Differential Equations (PDEs) over regular grids using the CUDA platform from NVIDIA. It breaks down a PDE into 3 basic objects, “Grids”, “Solvers,” and “Equations.” “Grid” data structures efficiently implement regular 1D, 2D, and 3D arrays in both double and single precision. Grids support operations like computing linear combinations, managing host-device memory transfers, interpolating values at non-grid points, and performing array-wide reductions. “Solvers” use these data structures to calculate terms arising from discretizations of PDEs, such as finite-difference based advection and diffusion schemes, and a multigrid solver for Poisson equations. These computational building blocks can be assembled into complete “Equation” objects that solve time-dependent PDEs. One such Equation solver is an incompressible Navier-Stokes solver that uses a second-order Boussinesq model. This equation solver is fully validated, and has been used to study Rayleigh-Benard convection under a variety of different regimes. Benchmarks show it to perform about 8 times faster than an equivalent Fortran code running on an 8-core Xeon.

Read the rest of this entry »

Efficient multiplication of polynomials on graphics hardware

August 31st, 2009

Abstract:

We present the algorithm to multiply univariate polynomials with integer coefficients efficiently using the Number Theoretic transform (NTT) on Graphics Processing Units (GPU). The same approach can be used to multiply large integers encoded as polynomials. Our algorithm exploits fused multiply-add capabilities of the graphics hardware. NTT multiplications are executed in parallel for a set of distinct primes followed by reconstruction using the Chinese Remainder theorem (CRT) on the GPU. Our benchmarking experiences show the NTT multiplication performance up to 77 GMul/s. We compared our approach with CPU-based implementations of polynomial and large integer multiplication provided by NTL and GMP libraries.

(Pavel Emeliyanenko: “Efficient multiplication of polynomials on graphics hardware”, Proceedings of the 8th International Conference on Advanced Parallel Processing Technologies, 24-25th August, Rapperswil, Switzerland. DOI: 10.1007/978-3-642-03644-6_11)

CULAtools: GPU-accelerated LAPACK

August 23rd, 2009

EM Photonics has recently released a preview beta edition of their CULAtools, an implementation of LAPACK for CUDA-enabled GPUs. This version comprises single precision LU decomposition, QR factorization, singular value decomposition and least squares. The full library, scheduled for release at NVIDIA GTC ’09, will contain much more functionality and in particular single- and double-precision computations. Please refer to the website culatools.com for details, licenses and downloads.

Numerical Precision: How Much is Enough?

June 30th, 2009

A ScientificComputing.com article by Rob Farber explores the topic of numerical precision in the context of future exascale computing, asking the question “how do we know that anything we compute is correct?”  The discussion centers around processors such as GPUs which provide both single- and double-precision computation but at different throughput levels. “Taking a multi-precision approach can enhance the accuracy of a calculation and justify the use of mainly single-precision arithmetic (for performance) along with the occasional use of double-precision (64-bit) arithmetic for precision-sensitive operations,” writes Farber. (Rob Farber. “Numerical Precision: How Much is Enough?” ScientificComputing.com.  Accessed July 1, 2008.)

SIGGRAPH Poster: Extended-Precision Floating-Point Numbers for GPU Computation

August 10th, 2006

Using unevaluated sums of paired or quadrupled single-precision (f32) values, double-float (df64) and quad-float (qf128) numeric types can be implemented on current GPUs and used efficiently and effectively for extended-precision computation for real and complex arithmetic. These numeric types provide 48 and 96 bits of precision respectively at f32 exponent ranges for computer graphics and general purpose (GPGPU) programming. Double- and quad-floats may be useful not only for extending available precision but also for accurate computation by only partially IEEE compliant single-precision floats. The poster and demos presented at ACM SIGGRAPH 06 discussed the implementation and application of these numbers in the Cg language for real and complex GPU programming. The df64 library includes math routines for exponential, log, and trigonometric functions. The poster can be downloaded from Andrew Thall’s website.  Technical details will be available shortly, and the code itself will be made available for distribution given sufficient interest.

Implementation of float-float operators on graphics hardware

July 22nd, 2006

Abstract: The Graphics Processing Unit (GPU) has evolved into a powerful and flexible processor. The latest graphics processors provide fully programmable vertex and pixel processing units that support vector operations up to single floating-point precision. This computational power is now being used for general-purpose computations. However, some applications require higher precision than single precision. This paper describes the emulation of a 44-bit floating-point number format and its corresponding operations. An implementation is presented along with performance and accuracy results. (G. Da Graca, D. Defour. Implementation of float-float operators on graphics hardware. 7th conference on Real Numbers and Computers, RNC7, Nancy, France, July 2006.)