amgcl: an accelerated algebraic multigrid for C++

December 21st, 2012

amgcl is a simple and generic algebraic multigrid (AMG) hierarchy builder. Supported coarsening methods are classical Ruge-Stuben coarsening, and either plain or smoothed aggregation. The constructed hierarchy is stored and used with help of one of the supported backends including VexCL, ViennaCL, and CUSPARSE/Thrust.

With help of amgcl, solution of a large sparse system of linear equations may be easily accelerated through OpenCL, CUDA, or OpenMP technologies. Source code of the library is publicly available under MIT license at

rCUDA 4.0 released

December 18th, 2012

rCUDA (remote CUDA) v4.0 has just been released. It provides full binary compatibility with CUDA applications (no need to modify the application source code or recompile your program), native InfiniBand support, enhanced data transfers, and CUDA 5.0 API support (excluding graphics interoperability). This new release of rCUDA allows to execute existing GPU-accelerated applications by leveraging remote GPUs within a cluster (both via sharing and/or aggregating GPUs) with a negligible overhead. The new version is available free of charge ar, along with examples, manuals and additional information.

Alea.cuBase – GPU computing in .NET

December 17th, 2012

Alea.cuBase allows to create GPU accelerated applications at all levels of sophistication, from simple GPU kernels up to complex GPU algorithms using textures, shared memory and other advanced GPU programming techniques, fully integrated into .NET. The GPU kernels are developed in functional language F# and are callable from any other .NET language. No additional wrappers or assembly translation processes are required. Alea.cuBase allows dynamic creation of GPU code at run time, thereby opening completely new dimensions for GPU accelerated applications. Trial versions are available at

Registration Now Open for GPU Technology Conference 2013

December 12th, 2012

From a recent announcement:

GTC is the largest conference dedicated to heterogeneous parallel computing to solve the most complex computational challenges and features 300+ sessions over four days.

Whether you’re a commercial developer responsible for getting applications or products to market quickly or a researcher whose results are tied to important funding sources, GTC 2013 offers exceptional opportunities to learn directly from some of the foremost thinkers and practitioners in parallel computing. Immerse yourself in the best practices, solutions, and techniques that can help you enhance your skills, improve your workflow, and accelerate time to your all-important results.

Register today at

Attended GTC before? You’re entitled to a 15% discount off a Full Conference or One-Day Conference pass. Please use discount code GMALUM15 when you register.

Final CFP : Third Workshop on Parallel Computing and Optimization, PCO’13, Boston, USA

December 3rd, 2012

The Third Workshop on Parallel Computing and Optimization (PCO13) is held in conjunction with the IEEE IPDPS symposium, Boston, USA, May 24, 2013. Paper submission deadline is January 4, 2013.

The workshop on Parallel Computing and Optimization aims at providing a forum for scientific researchers and engineers on recent advances in the field of parallel or distributed computing for difficult combinatorial optimization problems, like 0-1 multidimensional knapsack problems and cutting stock problems, large scale linear programming problems, nonlinear optimization problems and global optimization problems. Emphasis will be placed on new techniques for the solution of these difficult problems like cooperative methods for integer programming problems and polynomial optimization methods. Aspects related to Combinatorial Scientific Computing (CSC) will also be treated. Finally, the use of new approaches in parallel computing like GPU or hybrid computing, peer to peer computing and cloud computing will be considered. Application to planning, logistics, manufacturing, finance, telecommunications and computational biology will be considered.

Please refer to the workshop webpage at for more details, and for submission instructions.

ViennaCL 1.4.0 with CUDA, OpenCL and OpenMP support

December 3rd, 2012

The latest release 1.4.0 of the free open-source linear algebra library ViennaCL features the following highlights:

  • Two computing backends in addition to OpenCL: CUDA and OpenMP
  • Improved performance for (Block-) ILU0/ILUT preconditioners
  • Optional level scheduling for ILU substitutions on GPUs
  • Mixed-precision CG solver
  • Initializer types from Boost.uBLAS (unit_vector, zero_vector, etc.)

Any contributions of fast CUDA or OpenCL computing kernels for future releases of ViennaCL are welcome! More information is available at

Parallel Nonbinary LDPC Decoding on GPU

December 3rd, 2012


Nonbinary Low-Density Parity-Check (LDPC) codes are a class of error-correcting codes constructed over the Galois field GF(q) for q > 2. As extensions of binary LDPC codes, nonbinary LDPC codes can provide better error-correcting performance when the code length is short or moderate, but at a cost of higher decoding complexity. This paper proposes a massively parallel implementation of a nonbinary LDPC decoding accelerator based on a graphics processing unit (GPU) to achieve both great flexibility and scalability. The implementation maps the Min-Max decoding algorithm to GPU’s massively parallel architecture. We highlight the methodology to partition the decoding task to a heterogeneous platform consisting of the CPU and GPU. The experimental results show that our GPUbased implementation can achieve high throughput while still providing great flexibility and scalability.

(Guohui Wang, Hao Shen, Bei Yin, Michael Wu, Yang Sun, and Joseph R. Cavallaro: “Parallel Nonbinary LDPC Decoding on GPU”, 46th Asilomar Conference on Signals, Systems, and Computers (ASILOMAR), Nov. 4-7, 2012. [PDF])

Forward and Adjoint Simulations of Seismic Wave Propagation on Emerging Large-Scale GPU Architectures

November 14th, 2012


SPECFEM3D is a widely used community code which simulates seismic wave propagation in earth-science applications. It can be run either on multi-core CPUs only or together with many-core GPU devices on large GPU clusters. The new implementation is optimally fine-tuned and achieves excellent performance results. Mesh coloring enables an efficient accumulation of border nodes in the assembly process over an unstructured mesh on the GPU and asynchronous GPU-CPU memory transfers and non-blocking MPI are used to overlap communication and computation, effectively hiding synchronizations. To demonstrate the performance of the inversion, we present two case studies run on the Cray XE6 and XK6 architectures up to 896 nodes: (1) focusing on most commonly used forward simulations, we simulate wave propagation generated by earthquakes in Turkey, and (2) testing the most complex simulation type of the package, we use ambient seismic noise to image 3D crust and mantle structure beneath western Europe.

(Max Rietmann, Peter Messmer, Tarje Nissen-Meyer, Daniel Peter, Piero Basini, Dimitri Komatitsch, Olaf Schenk,  Jeroen Tromp, Lapo Boschi and Domenico Giardini, “Forward and Adjoint Simulations of Seismic Wave Propagation on Emerging Large-Scale GPU Architectures”, Proceedings of the 2012 ACM/IEEE conference on Supercomputing, Nov. 2012. [WWW])

A (ir)regularity-aware task scheduler for heterogeneous platforms

November 10th, 2012


This paper addresses the design, implementation and validation of an effective scheduling scheme for both regular and irregular applications on heterogeneous platforms. The scheduler uses an empirical performance model to dynamically schedule the workload, organized into a given number of chunks, and follows the Heterogeneous Earliest Finish Time (HEFT) scheduling algorithm, which ranks the tasks based on both their computation and communication costs. The evaluation of the proposed approach is based on three case studies – the SAXPY, the FFT and the Barnes-Hut algorithms – two regular and one irregular application. The scheduler was evaluated on a heterogeneous platform with one quad-core CPU-chip accelerated by one or two GPU devices, embedded in the GAMA framework. The evaluation runs measured the effectiveness, the efficiency and the scalability of the proposed method. Results show that the proposed model was effective in addressing both regular and irregular applications, on heterogeneous platforms, while achieving ideal (>=100%) levels of efficiency in the irregular Barnes-Hut algorithm.

(Artur Mariano, Ricardo Alves, Joao Barbosa, Luis Paulo Santos and Alberto Proenca: “A (ir)regularity-aware task scheduler for heterogeneous platforms”, Proceedings of the 2nd International Conference on High Performance Computing, Kiev, October 2012, pp 45-56,. [PDF])

GPU Technology Theater @ SC12

November 8th, 2012

Supercomputing luminaries and experts like Jack Dongarra and Takayuki Aoki will be presenting in NVIDIA’s GPU Technology Theater at SC12. Talks will happen every 30 minutes and will also be webcast live with interactive Q&A on NVIDIA’s website. For the complete lineup of science and developer talks visit SC12 takes place Nov. 10-16 in Salt Lake City, Utah.

Page 16 of 110« First...10...1415161718...304050...Last »