ViennaCL 1.2.0 released

January 2nd, 2012

Version 1.2.0 of the OpenCL-based C++ linear algebra library ViennaCL is now available for download! It features a high-level interface compatible with Boost.ublas, which allows for compact code and high productivity. Highlights of the new release are the following features (all experimental):

  • Several algebraic multigrid preconditioners
  • Sparse approximate inverse preconditioners
  • Fast Fourier transform
  • Structured dense matrices (circulant, Hankel, Toeplitz, Vandermonde)
  • Reordering algorithms (Cuthill-McKee, Gibbs-Poole-Stockmeyer)
  • Proxies for manipulating subvectors and submatrices

The features are expected to reach maturity in the 1.2.x branch. More information about the library including download links is available at http://viennacl.sourceforge.net.

Introduction to Generic Accelerated Computing with Libra SDK

November 30th, 2011

Libra SDK is a sophisticated runtime including API, sample programs and documentation for massively accelerating software computations. This introduction tutorial provides an overview and usage examples of the powerful Libra API & math libraries executing on x86/x64, OpenCL, OpenGL and CUDA technology. Libra API enables generic and portable CPU/GPU computing within software development without the need to create multiple, specific and optimized code paths to support x86, OpenCL, OpenGL or CUDA devices. Link to PDF: www.gpusystems.com/doc/LibraGenericComputing.pdf

CfP: 20th High Performance Computing Symposium 2012

October 7th, 2011

The 2012 Spring Simulation Multi-conference will feature the 20th High Performance Computing Symposium (HPC 2012), devoted to the impact of high performance computing and communications on computer simulations. Topics of interest include:

  • high performance/large scale application case studies,
  • GPUs for general purpose computations (GPGPU)
  • multicore and many-core computing,
  • power aware computing,
  • large scale visualization and data management,
  • tools and environments for coupling parallel codes,
  • parallel algorithms and architectures,
  • high performance software tools,
  • component technologies for high performance computing.

Important dates: Paper submission due: December 2, 2011; Notification of acceptance: January 13, 2012; Revised manuscript due: January 27, 2012; Symposium: March 26–29, 2012.

Parallel Smoothers for Matrix-based Multigrid Methods on Unstructured Meshes Using Multicore CPUs and GPUs

July 29th, 2011

Abstract:

Multigrid methods are efficient and fast solvers for problems typically modeled by partial differential equations of elliptic type. For problems with complex geometries and local singularities stencil-type discrete operators on equidistant Cartesian grids need to be replaced by more flexible concepts for unstructured meshes in order to properly resolve all problem-inherent specifics and for maintaining a moderate number of unknowns. However, flexibility in the meshes goes along with severe drawbacks with respect to parallel execution – especially with respect to the definition of adequate smoothers. This point becomes in particular pronounced in the framework of fine-grained parallelism on GPUs with hundreds of execution units. We use the approach of matrix-based multigrid that has high flexibility and adapts well to the exigences of modern computing platforms.

In this work we investigate multi-colored Gauss-Seidel type smoothers, the power(q)-pattern enhanced multi-colored ILU(p) smoothers with fill-ins, and factorized sparse approximate inverse (FSAI) smoothers. These approaches provide efficient smoothers with a high degree of parallelism. In combination with matrix-based multigrid methods on unstructured meshes our smoothers provide powerful solvers that are applicable across a wide range of parallel computing platforms and almost arbitrary geometries. We describe the configuration of our smoothers in the context of the portable lmpLAtoolbox and the HiFlow3 parallel finite element package. In our approach, a single source code can be used across diverse platforms including multicore CPUs and GPUs. Highly optimized implementations are hidden behind a unified user interface. Efficiency and scalability of our multigrid solvers are demonstrated by means of a comprehensive performance analysis on multicore CPUs and GPUs.

V. Heuveline, D. Lukarski, N. Trost and J.-P. Weiss. Parallel Smoothers for Matrix-based Multigrid Methods on Unstructured Meshes Using Multicore CPUs and GPUs. EMCL Preprint Series No. 9. 2011.

GPIUTMD 0.9.6 released

June 26th, 2011

GPIUTMD stands for Graphic Processors at Isfahan University of Technology for Many-particle Dynamics. It performs general-purpose many-particle dynamic simulations on a single workstation, taking advantage of NVIDIA GPUs to attain a level of performance equivalent to thousands of cores on a fast cluster. Flexible and configurable, GPIUTMD is currently being used for all atom and coarse-grained molecular dynamics simulations of nano-materials, glasses, and surfactants; dissipative particle dynamics simulations (DPD) of polymers; and crystallization of metals using EAM potentials. GPIUTMD 0.9.6 adds many new features. Highlights include:

  • Morse bond potential
  • Adding constant acceleration to a group of particles. (useful for modeling gravity effects)
  • Computes the full virial stress tensor (useful in mechanical characterization of materials)
  • Long-ranged electrostatics via PPPM
  • Support for CUDA 3.2
  • Theory manual
  • Up to twenty percent boost in simulations
  • and more

A demo version of GPIUTMD 0.9.6 will be available soon for download under an open source license. Check out the quick start tutorial to get started, or check out the full documentation to see everything it can do.

 

GPU computing in medical physics: A review

May 29th, 2011

Abstract:

The graphics processing unit (GPU) has emerged as a competitive platform for computing massively parallel problems. Many computing applications in medical physics can be formulated as data-parallel tasks that exploit the capabilities of the GPU for reducing processing times. The authors review the basic principles of GPU computing as well as the main performance optimization techniques, and survey existing applications in three areas of medical physics, namely image reconstruction, dose calculation and treatment plan optimization, and image processing.

(Guillem Pratx & Lei Xing: “GPU computing in medical physics: A review”, Med. Phys., vol 38(5), pp. 2685-2698, May 2011. [DOI])

A memory efficient and fast sparse matrix vector product on a GPU

May 4th, 2011

Abstract:

This paper proposes a new sparse matrix storage format which allows an efficient implementation of a sparse matrix vector product on a Fermi Graphics Processing Unit (GPU). Unlike previous formats it has both low memory footprint and good throughput. The new format, which we call Sliced ELLR-T has been designed specifically for accelerating the iterative solution of a large sparse and complex-valued system of linear equations arising in computational electromagnetics. Numerical tests have shown that the performance of the new implementation reaches 69 GFLOPS in complex single precision arithmetic. Compared to the optimized six core Central Processing Unit (CPU) (Intel Xeon 5680) this performance implies a speedup by a factor of six. In terms of speed the new format is as fast as the best format published so far and at the same time it does not introduce redundant zero elements which have to be stored to ensure fast memory access. Compared to previously published solutions, significantly larger problems can be handled using low cost commodity GPUs with limited amount of on-board memory.

(A. Dziekonski, A. Lamecki, and M. Mrozowski: “A memory efficient and fast sparse matrix vector product on a GPU“, Progress In Electromagnetics Research, Vol. 116, 49-63, 2011. [PDF])

High Throughput Parallel Molecular Dynamics for GPUs

April 6th, 2011

The North Carolina Renaissance Computing Institute (RENCI) is running Amber PMEMD on the Open Science Grid, the high throughput computing (HTC) fabric used by the Large Hadron Collider (LHC). This approach is likely to be helpful to researchers with any of these challenges:

  1. Constrained by limited computing resources including access to GPGPUs
  2. Manually executing the same simulation repeatedly with different parameters
  3. Making simulations easier to understand, share, scale and re-use across compute resources

For more information see these two blog posts: High Throughput Parallel Molecular Dynamics and CUDA/Tesla Accelerated PMEMD on OSG. Contact Steve Cox (scox@renci.org) if you’d like to discuss further and determine if your application is a fit. If it is, RENCI can provide access to the grid as well as tools for executing and managing simulations.

IMPETUS Afea Solver: A novel Finite Element code adapted to GPU technology

October 16th, 2010

IMPETUS Afea is proud to announce the launch of IMPETUS Afea Solver (version 1.0).

The IMPETUS Afea Solver is a non-linear explicit finite element tool. It is developed to predict large deformations of structures and components exposed to extreme loading conditions. The tool is applicable to transient dynamics and quasi-static loading conditions. The primary focus of the IMPETUS Afea Solver is accuracy, robustness and simplicity for the user. The number of purely numerical parameters that the user has to provide as input is kept at a minimum. The IMPETUS Afea Solver is adapted to GPU technology; utilizing the computational force of a potent graphics card can considerably speed up your calculations.

IMPETUS Afea Solver Video on YouTube

For more information or requests please contact sales@impetus-afea.com

High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster

June 23rd, 2010

Abstract:

We implement a high-order finite-element application, which performs the numerical simulation of seismic wave propagation resulting for instance from earthquakes at the scale of a continent or from active seismic acquisition experiments in the oil industry, on a large cluster of NVIDIA Tesla graphics cards using the CUDA programming environment and non-blocking message passing based on MPI. Contrary to many finite-element implementations, ours is implemented successfully in single precision, maximizing the performance of current generation GPUs. We discuss the implementation and optimization of the code and compare it to an existing very optimized implementation in C language and MPI on a classical cluster of CPU nodes. We use mesh coloring to efficiently handle summation operations over degrees of freedom on an unstructured mesh, and non-blocking MPI messages in order to overlap the communications across the network and the data transfer to and from the device via PCIe with calculations on the GPU. We perform a number of numerical tests to validate the single-precision CUDA and MPI implementation and assess its accuracy. We then analyze performance measurements and depending on how the problem is mapped to the reference CPU cluster, we obtain a speedup of 20x or 12x.

(Dimitri Komatisch, Gordon Erlebacher, Dominik Göddeke and David Michéa: “High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster”, accepted for publication in: Journal of Computational Physics, Jun. 2010. PDF preprint. DOI link.)

Page 1 of 3123