CUDA 4.1 Released

January 26th, 2012

Today NVIDIA released CUDA 4.1, including a new CUDA Toolkit, SDK, Visual Profiler, Parallel Nsight IDE and NVIDIA device driver.

CUDA 4.1 makes it easier to accelerate scientific research with GPUs with key features including

  • a redesigned Visual Profiler with automated performance analysis and expert guidance;
  • a new LLVM-based compiler that generates up to 10% faster code; and
  • 1000+ new imaging and signal processing functions in the NPP library.

The CuSparse library included with CUDA 4.1 has a new tridiagonal solver and 2x faster sparse matrix-vector multiplication using the ELL hybrid format, and the CuRand library included with CUDA 4.1 has two new random number generators. Read the rest of this entry »

PyCOOL: Python Cosmological Object-Oriented Lattice code

January 25th, 2012

PyCOOL (Cosmological Object-Oriented Lattice code) is a fast GPU accelerated program that solves the evolution of interacting scalar fields in an expanding universe with symplectic algorithms. The program has been written with the intention to hit a sweet spot of speed, accuracy and user friendliness. This is achieved by using the Python language with the  PyCUDA interface to make a program that is very easy to adapt to different scalar field models.  The program is publicly available under GNU General Public License at. See the PyCOOL website for more information.

Performance of SpMV in CUSPARSE, CUSP and SpeedIT

January 14th, 2012

The SpeedIt team recently compared and benchmarked the SpMV performance of CUSPARSE 4.0, CUSP 0.2.0 and SpeedIT 2.0 on 23 randomly chosen matrices from University Florida Matrix Collection. Comparisons were done on a Tesla C2050 in single and double precision. The full report is available at http://wp.me/p1ZihD-1.

Acceleware 4 Day CUDA Course

January 6th, 2012

Partnering with NVIDIA and Microsoft, this four day course is designed for Programmers who are looking to develop comprehensive skills in writing and optimizing applications that fully leverage the multi-core processing capabilities of the GPU.

Delivered by Acceleware’s Developers, who provide real world experience and examples, the training comprises classroom lectures and hands-on tutorials. Each student will be supplied with a laptop equipped with NVIDIA GPUs for the duration of the course. Small class sizes maximize learning and ensure a personal educational experience.

Register before January 13 and receive $250 off your course fee!
Enter promotional code AXTEB2012

HOOMD-blue 0.10.0 release

December 19th, 2011

HOOMD-blue performs general-purpose particle dynamics simulations on a single workstation, taking advantage of NVIDIA GPUs to attain a level of performance equivalent to many cores on a fast cluster. Flexible and configurable, HOOMD-blue is currently being used for coarse-grained molecular dynamics simulations of nano-materials, glasses, and surfactants, dissipative particle dynamics simulations (DPD) of polymers, and crystallization of metals.

HOOMD-blue 0.10.0 adds many new features. Highlights include: Read the rest of this entry »

On the Acceleration of Wavefront Applications using Distributed Many-Core Architectures

December 14th, 2011

Abstract:

In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applications—a ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P).

Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures.

(Pennycook, S.J., Hammond, S.D., Mudalige, G.R., Wright, S.A. and Jarvis, S.A.: “On the Acceleration of Wavefront Applications using Distributed Many-Core Architectures”,  The Computer Journal (in press) [DOI] [PREPRINT])

CUDA 4.1 RC2 Released

December 6th, 2011

The NVIDIA CUDA Toolkit 4.1 RC2 is now available for anyone to download. The key features of this release are:

  • A new LLVM based compiler
  • Over 1000 additional image processing function in the NPP library
  • A Visual profiler

There is also a new version of Parallel Nsight 2.1 RC2 with support for CUDA 4.1. To download and to find out more follow: http://bit.ly/sRpQvr

Introduction to Generic Accelerated Computing with Libra SDK

November 30th, 2011

Libra SDK is a sophisticated runtime including API, sample programs and documentation for massively accelerating software computations. This introduction tutorial provides an overview and usage examples of the powerful Libra API & math libraries executing on x86/x64, OpenCL, OpenGL and CUDA technology. Libra API enables generic and portable CPU/GPU computing within software development without the need to create multiple, specific and optimized code paths to support x86, OpenCL, OpenGL or CUDA devices. Link to PDF: www.gpusystems.com/doc/LibraGenericComputing.pdf

Alenka – A GPU database engine including compression

November 28th, 2011

Support for several types of compression has been added to the GPU-based database engine ålenkå . Supported algorithms include FOR (frame of reference), FOR-DELTA and dictionary compression. All compression algorithms run on the GPU achieving gigabytes per second compression and decompression speed. The use of compression allows to significantly reduce or eliminate I/O bottlenecks in analytical queries as shown by ålenkå’s results in the Star Schema and TPC-H benchmarks.

Integrating CUDA and GNU Autotools

November 17th, 2011

ClusterChimps.org has released a step by step guide to integrating CUDA with GNU Autotools. The guide covers building stand alone CUDA binaries, static CUDA libraries, shared CUDA libraries and comes with an example tarball. For more information go to http://www.clusterchimps.org/autotools.php

Page 1 of 2412345...1020...Last »