Free GPU Computing Workshop in Adelaide, South Australia

July 29th, 2010

eResearch SA, XENON Systems and NVIDIA invite you to attend a free workshop on GPU computing with CUDA. The workshop will be held at 1:00PM on Tuesday 10 August 2010 at Mawson Lakes, in the Mawson Centre Lecture Theatre MC1-02.

Register now by visiting: http://nvidia.eventbrite.com

Read the rest of this entry »

A complete modular resultant algorithm targeted for realization on graphics hardware

July 29th, 2010

Abstract:

This paper presents a complete modular approach to computing bivariate polynomial resultants on Graphics Processing Units (GPU). Given two polynomials, the algorithm first maps them to a prime field for sufficiently many primes, and then processes each modular image individually. We evaluate each polynomial at several points and compute a set of univariate resultants for each prime in parallel on the GPU. The remaining “combine” stage of the algorithm comprising polynomial interpolation and Chinese remaindering is also executed on the graphics processor. The GPU algorithm returns coefficients of the resultant as a set of Mixed Radix (MR) digits. Finally, the large integer coefficients are recovered from the MR representation on the host machine. With the approach of displacement structure and efficient modular arithmetic we have been able to achieve more than 100x speed-up over a CPU-based resultant algorithm from Maple 13.

(Pavel Emeliyanenko: “A complete modular resultant algorithm targeted for realization on graphics hardware”, Proceedings of the 4th International Workshop on Parallel and Symbolic Computation (PASCO2010), pages 35-43, Grenoble, France, July 2010. DOI link.  Direct PDF link.)

Swarm-NG: integration of an ensemble of N-body systems

July 29th, 2010

The Swarm-NG package helps scientists and engineers harness the power of GPUs. In the early releases, Swarm-NG will focus on the integration of an ensemble of N-body systems evolving under Newtonian gravity. Swarm-NG does not replicate existing libraries that calculate forces for large-N systems on GPUs, but rather focuses on integrating an ensemble of many systems where N is small. This is of particular interest for astronomers who study the chaotic evolution of planetary systems. In the long term, we hope Swarm-NG will allow for the efficient parallel integration of user-defined systems of ordinary differential equations.

QYMSYM: A GPU-Accelerated Hybrid Symplectic Integrator That Permits Close Encounters

July 29th, 2010

Abstract:

We describe a parallel hybrid symplectic integrator for planetary system integration that runs on a graphics processing unit (GPU). The integrator identifies close approaches between particles and switches from symplectic to Hermite algorithms for particles that require higher resolution integrations. The integrator is approximately as accurate as other hybrid symplectic integrators but is GPU accelerated.

(Alexander Moore and Alice C. Quillen: “QYMSYM: A GPU-Accelerated Hybrid Symplectic Integrator That Permits Close Encounters”. preprint on arXiv, available code)

SMVM on GPU

July 29th, 2010

From the paper’s abstract:

A wide class of finite element electromagnetic applications requires computing very large sparse matrix vector multiplications (SMVM). Due to the sparsity pattern and size of the matrices, solvers can run relatively slowly. The rapid evolution of graphic processing units (GPUs) in performance, architecture and programmability make them very attractive platforms for accelerating computationally intensive kernels such as SMVM. This work presents a new algorithm to accelerate the performance of the SMVM kernel on graphic processing units.

From the paper’s conclusion:

We have introduced several efficient techniques to accelerate the execution of the sparse matrix vector multiplication (SMVM) on NVIDIA graphic processing units. The proposed methods increased the performance of the SMVM kernel on GT 8800 up to 18.8 times compared to the quad core CPU and 3 times compared to previous work by Bell and Garland on accelerating SMVM for GPUs.

(M. Mehri Dehnavi, D. Fernandez and D. Giannacopoulos: “Finite element sparse matrix vector multiplication on GPUs”. IEEE Transactions on Magnetics, vol. 46, no. 8, pp. 2982-2985, August 2010. DOI 10.1109/TMAG.2010.2043511)

Ocelot: A Dynamic Optimization Framework for Bulk-Synchronous Applications in Heterogeneous Systems

July 29th, 2010

Abstract:

Ocelot is a dynamic compilation framework designed to map the explicitly data parallel execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms. Ocelot includes a dynamic binary translator from Parallel Thread eXecution ISA (PTX) to many-core processors that leverages the Low Level Virtual Machine (LLVM) code generator to target x86 and other ISAs. The dynamic compiler is able to execute existing CUDA binaries without recompilation from source and supports switching between execution on an NVIDIA GPU and a many-core CPU at runtime. It has been validated against over 130 applications taken from the CUDA SDK, the UIUC Parboil benchmark, the Virginia Rodinia benchmarks, the GPU-VSIPL signal and image processing library, the Thrust library, and several domain specific applications.

This paper presents a high level overview of the implementation of the Ocelot dynamic compiler highlighting design decisions and trade-offs, and showcasing their effect on application performance. Several novel code transformations are explored that are applicable only when compiling explicitly parallel applications and traditional dynamic compiler optimizations are revisited for this new class of applications. This study is expected to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures.

This paper identifies several key areas of research and open problems for optimizing the performance of data parallel programs (such as CUDA and OpenCL) that were encountered when designing a binary translator from PTX to LLVM/x86. The complete implementation of Ocelot is available open-source under the new BSD license at http://code.google.com/p/gpuocelot. Ongoing work involves translating PTX to AMD’s IL allowing CUDA programs to be executed on AMD GPUs, developing parallel-aware PTX to PTX optimizations, and exploring new programming and execution models that are layered on PTX.

(Gregory Diamos, Andrew Kerr, Sudhakar Yalamanchili and Nathan Clark: “Ocelot: A dynamic compiler for bulk-synchroneous applications in heterogeneous systems”. 19 International Conference on Parallel Architectures and Compilation Techniques (PACT2010), September 2010).

NVIDIA Parallel Nsight Now Shipping

July 21st, 2010

NVIDIA today announced the release of NVIDIA Parallel Nsight software, the industry’s first development environment for GPU-accelerated applications that work with Microsoft Visual Studio.  ”By adding functionality specifically for GPU Computing developers, Parallel Nsight makes the power of the GPU more accessible than ever before,” said Sanford Russell, GM of GPU Computing at NVIDIA. NVIDIA Parallel NSight features a CUDA C/C++ debugger and application performance analyzer, and a graphics debugger and inspector.  NVIDIA Parallel Nsight supports Windows HPC Server 2008, Windows 7 and Windows Vista.  Download Parallel Nsight here.

OpenMM 2.0 Now Available to Accelerate Molecular Dynamics on NVIDIA and ATI GPUs

July 18th, 2010

Simbios, the NIH Center for Biomedical Computation at Stanford University, is excited to announce the release of OPENMM 2.0.

OPENMM was designed to enhance the performance of almost any molecular dynamics simulation package (MD package) by allowing the code to be executed on high performance computer architectures, in particular Graphics Processing Units (GPUs). Most molecular dynamics packages can be modified to call OPENMM, resulting in significant acceleration on such high performance architectures, without changing the way users interact with the MD package. Read the rest of this entry »

A Real-Time Multigrid Finite Hexahedra Method for Elasticity

July 11th, 2010

A 30,000-hexahedron FEM model.

Abstract:

In this paper we present a GPU-based multigrid approach for simulating elastic deformable objects in real time. Our method is based on a finite element discretization of the deformable object using hexahedra. It draws upon recent work on multigrid schemes for the efficient numerical solution of partial differential equations on such discretizations. Due to the regular shape of the numerical stencil induced by the hexahedral regime, and since we use matrix-free formulations of all multigrid steps, computations and data layout can be restructured to avoid execution divergence and to support memory access patterns which enable the hardware to coalesce multiple memory accesses into single memory transactions. This enables to effectively exploit the GPU’s parallel processing units and high memory bandwidth via the CUDA parallel programming API. We demonstrate performance gains of up to a factor of 12 compared to a highly optimized CPU implementation. By using our approach, physics-based simulation at an object resolution of 64^3 is achieved at interactive rates.

(Christian Dick, Joachim Georgii and Rüdiger Westermann: “A Real-Time Multigrid Finite Hexahedra Method for Elasticity”http://wwwcg.in.tum.de/Research/Publications/CompMechanics)

CULA 2.0 released

July 11th, 2010

EM Photonics announced today the general availability of CULA 2.0, its GPU-accelerated linear algebra library. The new version provides support for NVIDIA GPUs based on the latest “Fermi” architecture.

CULA contains a LAPACK interface comprised of over 150 mathematical routines from the industry standard for computational linear algebra, LAPACK. EM Photonics’ CULA library includes many popular routines including system solvers, least squares solvers, orthogonal factorizations, eigenvalue routines, and singular value decompositions. CULA offers performance up to a magnitude faster than highly optimized CPU-based linear algebra solvers. There is a variety of different interfaces available to integrate directly into your existing code. Programmers can easily call GPU-accelerated CULA from their C/C++, FORTRAN, MATLAB, or Python codes. This can all be done with no GPU programming experience. CULA is available for every system equipped with GPUs based on the NVIDIA CUDA architecture. This includes 32- and 64-bit versions of Linux, Windows, and OS X.

More information is available at www.culatools.com.

Page 1 of 6312345102030...Last »