rCUDA (remote CUDA) v4.0 has just been released. It provides full binary compatibility with CUDA applications (no need to modify the application source code or recompile your program), native InfiniBand support, enhanced data transfers, and CUDA 5.0 API support (excluding graphics interoperability). This new release of rCUDA allows to execute existing GPU-accelerated applications by leveraging remote GPUs within a cluster (both via sharing and/or aggregating GPUs) with a negligible overhead. The new version is available free of charge ar www.rCUDA.net, along with examples, manuals and additional information.
Alea.cuBase allows to create GPU accelerated applications at all levels of sophistication, from simple GPU kernels up to complex GPU algorithms using textures, shared memory and other advanced GPU programming techniques, fully integrated into .NET. The GPU kernels are developed in functional language F# and are callable from any other .NET language. No additional wrappers or assembly translation processes are required. Alea.cuBase allows dynamic creation of GPU code at run time, thereby opening completely new dimensions for GPU accelerated applications. Trial versions are available at http://www.quantalea.net/products.
From a recent announcement:
GTC is the largest conference dedicated to heterogeneous parallel computing to solve the most complex computational challenges and features 300+ sessions over four days.
Whether you’re a commercial developer responsible for getting applications or products to market quickly or a researcher whose results are tied to important funding sources, GTC 2013 offers exceptional opportunities to learn directly from some of the foremost thinkers and practitioners in parallel computing. Immerse yourself in the best practices, solutions, and techniques that can help you enhance your skills, improve your workflow, and accelerate time to your all-important results.
Register today at http://www.gputechconf.com/page/registration-pricing.html.
Attended GTC before? You’re entitled to a 15% discount off a Full Conference or One-Day Conference pass. Please use discount code GMALUM15 when you register.
The Third Workshop on Parallel Computing and Optimization (PCO13) is held in conjunction with the IEEE IPDPS symposium, Boston, USA, May 24, 2013. Paper submission deadline is January 4, 2013.
The workshop on Parallel Computing and Optimization aims at providing a forum for scientific researchers and engineers on recent advances in the field of parallel or distributed computing for difficult combinatorial optimization problems, like 0-1 multidimensional knapsack problems and cutting stock problems, large scale linear programming problems, nonlinear optimization problems and global optimization problems. Emphasis will be placed on new techniques for the solution of these difficult problems like cooperative methods for integer programming problems and polynomial optimization methods. Aspects related to Combinatorial Scientific Computing (CSC) will also be treated. Finally, the use of new approaches in parallel computing like GPU or hybrid computing, peer to peer computing and cloud computing will be considered. Application to planning, logistics, manufacturing, finance, telecommunications and computational biology will be considered.
Please refer to the workshop webpage at http://conf.laas.fr/PCO13 for more details, and for submission instructions.
The latest release 1.4.0 of the free open-source linear algebra library ViennaCL features the following highlights:
- Two computing backends in addition to OpenCL: CUDA and OpenMP
- Improved performance for (Block-) ILU0/ILUT preconditioners
- Optional level scheduling for ILU substitutions on GPUs
- Mixed-precision CG solver
- Initializer types from Boost.uBLAS (unit_vector, zero_vector, etc.)
Any contributions of fast CUDA or OpenCL computing kernels for future releases of ViennaCL are welcome! More information is available at http://viennacl.sourceforge.net.
Nonbinary Low-Density Parity-Check (LDPC) codes are a class of error-correcting codes constructed over the Galois field GF(q) for q > 2. As extensions of binary LDPC codes, nonbinary LDPC codes can provide better error-correcting performance when the code length is short or moderate, but at a cost of higher decoding complexity. This paper proposes a massively parallel implementation of a nonbinary LDPC decoding accelerator based on a graphics processing unit (GPU) to achieve both great flexibility and scalability. The implementation maps the Min-Max decoding algorithm to GPU’s massively parallel architecture. We highlight the methodology to partition the decoding task to a heterogeneous platform consisting of the CPU and GPU. The experimental results show that our GPUbased implementation can achieve high throughput while still providing great flexibility and scalability.
(Guohui Wang, Hao Shen, Bei Yin, Michael Wu, Yang Sun, and Joseph R. Cavallaro: “Parallel Nonbinary LDPC Decoding on GPU”, 46th Asilomar Conference on Signals, Systems, and Computers (ASILOMAR), Nov. 4-7, 2012. [PDF])
Forward and Adjoint Simulations of Seismic Wave Propagation on Emerging Large-Scale GPU ArchitecturesNovember 14th, 2012
SPECFEM3D is a widely used community code which simulates seismic wave propagation in earth-science applications. It can be run either on multi-core CPUs only or together with many-core GPU devices on large GPU clusters. The new implementation is optimally fine-tuned and achieves excellent performance results. Mesh coloring enables an efficient accumulation of border nodes in the assembly process over an unstructured mesh on the GPU and asynchronous GPU-CPU memory transfers and non-blocking MPI are used to overlap communication and computation, effectively hiding synchronizations. To demonstrate the performance of the inversion, we present two case studies run on the Cray XE6 and XK6 architectures up to 896 nodes: (1) focusing on most commonly used forward simulations, we simulate wave propagation generated by earthquakes in Turkey, and (2) testing the most complex simulation type of the package, we use ambient seismic noise to image 3D crust and mantle structure beneath western Europe.
(Max Rietmann, Peter Messmer, Tarje Nissen-Meyer, Daniel Peter, Piero Basini, Dimitri Komatitsch, Olaf Schenk, Jeroen Tromp, Lapo Boschi and Domenico Giardini, “Forward and Adjoint Simulations of Seismic Wave Propagation on Emerging Large-Scale GPU Architectures”, Proceedings of the 2012 ACM/IEEE conference on Supercomputing, Nov. 2012. [WWW])
This paper addresses the design, implementation and validation of an effective scheduling scheme for both regular and irregular applications on heterogeneous platforms. The scheduler uses an empirical performance model to dynamically schedule the workload, organized into a given number of chunks, and follows the Heterogeneous Earliest Finish Time (HEFT) scheduling algorithm, which ranks the tasks based on both their computation and communication costs. The evaluation of the proposed approach is based on three case studies – the SAXPY, the FFT and the Barnes-Hut algorithms – two regular and one irregular application. The scheduler was evaluated on a heterogeneous platform with one quad-core CPU-chip accelerated by one or two GPU devices, embedded in the GAMA framework. The evaluation runs measured the effectiveness, the efficiency and the scalability of the proposed method. Results show that the proposed model was effective in addressing both regular and irregular applications, on heterogeneous platforms, while achieving ideal (>=100%) levels of efficiency in the irregular Barnes-Hut algorithm.
(Artur Mariano, Ricardo Alves, Joao Barbosa, Luis Paulo Santos and Alberto Proenca: “A (ir)regularity-aware task scheduler for heterogeneous platforms”, Proceedings of the 2nd International Conference on High Performance Computing, Kiev, October 2012, pp 45-56,. [PDF])
Supercomputing luminaries and experts like Jack Dongarra and Takayuki Aoki will be presenting in NVIDIA’s GPU Technology Theater at SC12. Talks will happen every 30 minutes and will also be webcast live with interactive Q&A on NVIDIA’s website. For the complete lineup of science and developer talks visit http://www.nvidia.com/object/sc12-technology-theater.html. SC12 takes place Nov. 10-16 in Salt Lake City, Utah.
We describe an extension of our graphics processing unit (GPU) electronic structure program TeraChem to include atom-centered Gaussian basis sets with d angular momentum functions. This was made possible by a “meta-programming” strategy that leverages computer algebra systems for the derivation of equations and their transformation to correct code. We generate a multitude of code fragments that are formally mathematically equivalent, but differ in their memory and floating-point operation footprints. We then select between different code fragments using empirical testing to find the highest performing code variant. This leads to an optimal balance of floating-point operations and memory bandwidth for a given target architecture without laborious manual tuning. We show that this approach is capable of similar performance compared to our hand-tuned GPU kernels for basis sets with s and p angular momenta. We also demonstrate that mixed precision schemes (using both single and double precision) remain stable and accurate for molecules with d functions. We provide benchmarks of the execution time of entire self-consistent field (SCF) calculations using our GPU code and compare to mature CPU based codes, showing the benefits of the GPU architecture for electronic structure theory with appropriately redesigned algorithms. We suggest that the meta-programming and empirical performance optimization approach may be important in future computational chemistry applications, especially in the face of quickly evolving computer architectures.
(Alexey V Titov , Ivan S. Ufimtsev , Nathan Luehr and Todd J. Martínez: “Generating Efficient Quantum Chemistry Codes for Novel Architectures”, accepted for publication in the Journal of Chemical Theory and Computation, 2012. [DOI])