Parallel Nonbinary LDPC Decoding on GPU

December 3rd, 2012


Nonbinary Low-Density Parity-Check (LDPC) codes are a class of error-correcting codes constructed over the Galois field GF(q) for q > 2. As extensions of binary LDPC codes, nonbinary LDPC codes can provide better error-correcting performance when the code length is short or moderate, but at a cost of higher decoding complexity. This paper proposes a massively parallel implementation of a nonbinary LDPC decoding accelerator based on a graphics processing unit (GPU) to achieve both great flexibility and scalability. The implementation maps the Min-Max decoding algorithm to GPU’s massively parallel architecture. We highlight the methodology to partition the decoding task to a heterogeneous platform consisting of the CPU and GPU. The experimental results show that our GPUbased implementation can achieve high throughput while still providing great flexibility and scalability.

(Guohui Wang, Hao Shen, Bei Yin, Michael Wu, Yang Sun, and Joseph R. Cavallaro: “Parallel Nonbinary LDPC Decoding on GPU”, 46th Asilomar Conference on Signals, Systems, and Computers (ASILOMAR), Nov. 4-7, 2012. [PDF])

Forward and Adjoint Simulations of Seismic Wave Propagation on Emerging Large-Scale GPU Architectures

November 14th, 2012


SPECFEM3D is a widely used community code which simulates seismic wave propagation in earth-science applications. It can be run either on multi-core CPUs only or together with many-core GPU devices on large GPU clusters. The new implementation is optimally fine-tuned and achieves excellent performance results. Mesh coloring enables an efficient accumulation of border nodes in the assembly process over an unstructured mesh on the GPU and asynchronous GPU-CPU memory transfers and non-blocking MPI are used to overlap communication and computation, effectively hiding synchronizations. To demonstrate the performance of the inversion, we present two case studies run on the Cray XE6 and XK6 architectures up to 896 nodes: (1) focusing on most commonly used forward simulations, we simulate wave propagation generated by earthquakes in Turkey, and (2) testing the most complex simulation type of the package, we use ambient seismic noise to image 3D crust and mantle structure beneath western Europe.

(Max Rietmann, Peter Messmer, Tarje Nissen-Meyer, Daniel Peter, Piero Basini, Dimitri Komatitsch, Olaf Schenk,  Jeroen Tromp, Lapo Boschi and Domenico Giardini, “Forward and Adjoint Simulations of Seismic Wave Propagation on Emerging Large-Scale GPU Architectures”, Proceedings of the 2012 ACM/IEEE conference on Supercomputing, Nov. 2012. [WWW])

A (ir)regularity-aware task scheduler for heterogeneous platforms

November 10th, 2012


This paper addresses the design, implementation and validation of an effective scheduling scheme for both regular and irregular applications on heterogeneous platforms. The scheduler uses an empirical performance model to dynamically schedule the workload, organized into a given number of chunks, and follows the Heterogeneous Earliest Finish Time (HEFT) scheduling algorithm, which ranks the tasks based on both their computation and communication costs. The evaluation of the proposed approach is based on three case studies – the SAXPY, the FFT and the Barnes-Hut algorithms – two regular and one irregular application. The scheduler was evaluated on a heterogeneous platform with one quad-core CPU-chip accelerated by one or two GPU devices, embedded in the GAMA framework. The evaluation runs measured the effectiveness, the efficiency and the scalability of the proposed method. Results show that the proposed model was effective in addressing both regular and irregular applications, on heterogeneous platforms, while achieving ideal (>=100%) levels of efficiency in the irregular Barnes-Hut algorithm.

(Artur Mariano, Ricardo Alves, Joao Barbosa, Luis Paulo Santos and Alberto Proenca: “A (ir)regularity-aware task scheduler for heterogeneous platforms”, Proceedings of the 2nd International Conference on High Performance Computing, Kiev, October 2012, pp 45-56,. [PDF])

GPU Technology Theater @ SC12

November 8th, 2012

Supercomputing luminaries and experts like Jack Dongarra and Takayuki Aoki will be presenting in NVIDIA’s GPU Technology Theater at SC12. Talks will happen every 30 minutes and will also be webcast live with interactive Q&A on NVIDIA’s website. For the complete lineup of science and developer talks visit SC12 takes place Nov. 10-16 in Salt Lake City, Utah.

Generating Efficient Quantum Chemistry Codes for Novel Architectures

November 8th, 2012


We describe an extension of our graphics processing unit (GPU) electronic structure program TeraChem to include atom-centered Gaussian basis sets with d angular momentum functions. This was made possible by a “meta-programming” strategy that leverages computer algebra systems for the derivation of equations and their transformation to correct code. We generate a multitude of code fragments that are formally mathematically equivalent, but differ in their memory and floating-point operation footprints. We then select between different code fragments using empirical testing to find the highest performing code variant. This leads to an optimal balance of floating-point operations and memory bandwidth for a given target architecture without laborious manual tuning. We show that this approach is capable of similar performance compared to our hand-tuned GPU kernels for basis sets with s and p angular momenta. We also demonstrate that mixed precision schemes (using both single and double precision) remain stable and accurate for molecules with d functions. We provide benchmarks of the execution time of entire self-consistent field (SCF) calculations using our GPU code and compare to mature CPU based codes, showing the benefits of the GPU architecture for electronic structure theory with appropriately redesigned algorithms. We suggest that the meta-programming and empirical performance optimization approach may be important in future computational chemistry applications, especially in the face of quickly evolving computer architectures.

(Alexey V Titov , Ivan S. Ufimtsev , Nathan Luehr  and Todd J. Martínez: “Generating Efficient Quantum Chemistry Codes for Novel Architectures”, accepted for publication in the Journal of Chemical Theory and Computation, 2012. [DOI])

CfP: High Performance Computing Symposium

November 8th, 2012

The 21st High Performance Computing Symposium (HPC 2013), devoted to the impact of high performance computing and communications on computer simulations. Advances in multicore and many-core architectures, networking, high end computers, large data stores, and middleware capabilities are ushering in a new era of high performance parallel and distributed simulations. Along with these new capabilities come new challenges in computing and system modeling. The goal of HPC 2013 is to encourage innovation in high performance computing
and communication technologies and to promote synergistic advances in modeling methodologies and simulation. It will promote the exchange of ideas and information between universities, industry, and national laboratories about new developments in system modeling, high performance computing and communication, and scientific computing and simulation. Read the rest of this entry »

Call For Papers: Sixth Workshop on General Purpose Processing Using GPUs

November 6th, 2012

The Sixth Workshop on General Purpose Processing Using GPUs (GPGPU6) is held in conjunction with ASPLOS XVIII, Houston, TX, March 17, 2013.

Overview: The goal of this workshop is to provide a forum to discuss new and emerging general-purpose purpose programming environments and platforms, as well as evaluate applications that have been able to harness the horsepower provided by these platforms. This year’s work is particularly interested on new heterogeneous GPU platforms. Papers are being sought on many aspects of GPUs, including (but not limited to):

  • GPU applications + GPU compilation
  • GPU programming environments + GPU power/efficiency
  • GPU architectures + GPU benchmarking/measurements
  • Multi-GPU systems + Heterogeneous GPU platforms

Submission Information: Authors should submit their papers using the ACM SIG Proceedings format in double-column style using the directions on the conference website at Submitted papers will be evaluated based on originality, significance to topics, technical soundness, and presentation quality. At least one author must register and attend GPGPU to present the work. Accepted papers will be included in preliminary proceedings and distributed at the event. All papers will be made available at the workshop and will also be published in the ACM Conference Proceedings Series.

GPU Technology Conference 2013 Call for Posters is Open

November 6th, 2012

We’re looking for novel or interesting research topics in GPU computing, computer graphics, cloud graphics, game development, and applications of GPUs. We strongly encourage international attendees to submit early in order to receive notifications in time for US visa deadlines. Learn more at

OpenCL CodeBench Eclipse Code Creation Tools

November 3rd, 2012

OpenCL CodeBench is a code creation and productivity tools suite designed to accelerate and simplify OpenCL software development. OpenCL CodeBench provides developers with automation tools for host code and unit test bench generation. Kernel code development on OpenCL is accelerated and enhanced through a language aware editor delivering advanced incremental code analysis features. Software Programmers new to OpenCL can choose to be guided through an Eclipse wizard, while the power users can leverage the command line interface with XML-based configuration files. OpenCL CodeBench Beta is now available for Linux and Windows operating systems.

Improved Row-grouped CSR Format for Storing of Sparse Matrices on GPU

October 30th, 2012


We present new format for storing sparse matrices on GPU. We compare it with several other formats including CUSPARSE which is today probably the best choice for processing of sparse matrices on GPU in CUDA. Contrary to CUSPARSE which works with common CSR format, our new format requires conversion. However, multiplication of sparse-matrix and vector is significantly faster for many matrices. We demonstrate it on set of 1 600 matrices and we show for what types of matrices our format is profitable.

(Heller M., Oberhuber T.: “Improved Row-grouped CSR Format for Storing of Sparse Matrices on GPU”, Proceedings of Algoritmy 2012, 2012, Handlovičová A., Minarechová Z. and Ševčovič D. (ed.), pages 282-290, ISBN 978-80-227-3742-5) [ARXIV preprint]

Page 16 of 109« First...10...1415161718...304050...Last »