The 21st High Performance Computing Symposium (HPC 2013), devoted to the impact of high performance computing and communications on computer simulations. Advances in multicore and many-core architectures, networking, high end computers, large data stores, and middleware capabilities are ushering in a new era of high performance parallel and distributed simulations. Along with these new capabilities come new challenges in computing and system modeling. The goal of HPC 2013 is to encourage innovation in high performance computing

and communication technologies and to promote synergistic advances in modeling methodologies and simulation. It will promote the exchange of ideas and information between universities, industry, and national laboratories about new developments in system modeling, high performance computing and communication, and scientific computing and simulation. Read the rest of this entry »

## CfP: High Performance Computing Symposium

November 8th, 2012## Call For Papers: Sixth Workshop on General Purpose Processing Using GPUs

November 6th, 2012The Sixth Workshop on General Purpose Processing Using GPUs (GPGPU6) is held in conjunction with ASPLOS XVIII, Houston, TX, March 17, 2013.

Overview: The goal of this workshop is to provide a forum to discuss new and emerging general-purpose purpose programming environments and platforms, as well as evaluate applications that have been able to harness the horsepower provided by these platforms. This year’s work is particularly interested on new heterogeneous GPU platforms. Papers are being sought on many aspects of GPUs, including (but not limited to):

- GPU applications + GPU compilation
- GPU programming environments + GPU power/efficiency
- GPU architectures + GPU benchmarking/measurements
- Multi-GPU systems + Heterogeneous GPU platforms

Submission Information: Authors should submit their papers using the ACM SIG Proceedings format in double-column style using the directions on the conference website at http://www.ece.neu.edu/groups/nucar/GPGPU/GPGPU6. Submitted papers will be evaluated based on originality, significance to topics, technical soundness, and presentation quality. At least one author must register and attend GPGPU to present the work. Accepted papers will be included in preliminary proceedings and distributed at the event. All papers will be made available at the workshop and will also be published in the ACM Conference Proceedings Series.

## GPU Technology Conference 2013 Call for Posters is Open

November 6th, 2012We’re looking for novel or interesting research topics in GPU computing, computer graphics, cloud graphics, game development, and applications of GPUs. We strongly encourage international attendees to submit early in order to receive notifications in time for US visa deadlines. Learn more at http://www.gputechconf.com/page/home.html.

## OpenCL CodeBench Eclipse Code Creation Tools

November 3rd, 2012OpenCL CodeBench is a code creation and productivity tools suite designed to accelerate and simplify OpenCL software development. OpenCL CodeBench provides developers with automation tools for host code and unit test bench generation. Kernel code development on OpenCL is accelerated and enhanced through a language aware editor delivering advanced incremental code analysis features. Software Programmers new to OpenCL can choose to be guided through an Eclipse wizard, while the power users can leverage the command line interface with XML-based configuration files. OpenCL CodeBench Beta is now available for Linux and Windows operating systems.

## Improved Row-grouped CSR Format for Storing of Sparse Matrices on GPU

October 30th, 2012Abstract:

We present new format for storing sparse matrices on GPU. We compare it with several other formats including CUSPARSE which is today probably the best choice for processing of sparse matrices on GPU in CUDA. Contrary to CUSPARSE which works with common CSR format, our new format requires conversion. However, multiplication of sparse-matrix and vector is significantly faster for many matrices. We demonstrate it on set of 1 600 matrices and we show for what types of matrices our format is profitable.

(Heller M., Oberhuber T.: *“Improved Row-grouped CSR Format for Storing of Sparse Matrices on GPU”*, Proceedings of Algoritmy 2012, 2012, Handlovičová A., Minarechová Z. and Ševčovič D. (ed.), pages 282-290, ISBN 978-80-227-3742-5) [ARXIV preprint]

## Jacket v2.3 Now Available for GPU computing in MATLAB

October 26th, 2012Jacket enables GPU computing for MATLAB® codes. The new version v2.3 includes performance improvements and new support for CUDA 5.0. This newer version of CUDA enables computation on the latest Kepler K20 GPUs of the NVIDIA Tesla product line.

More information: http://blog.accelereyes.com/blog/2012/10/23/jacket-v2-3/

## Webinar: Learn How GPU-Accelerated Applications Benefit Academic Research

October 26th, 2012GPUs have become a corner stone of computational research in high performance computing with over 200 commonly used applications already GPU-enabled. Researchers across many domains, such as Computational Chemistry, Biology, Weather & Climate, and Engineering, are using GPU-accelerated applications to greatly reduce time to discovery by achieving results that were simply not possible before.

Join Devang Sachdev, Sr. Product Manager, NVIDIA for an overview of the most popular applications used in academic research and an account of success stories enabled by GPUs. Learn also about a complimentary program which allows researchers to easily try GPU-accelerated applications on a remotely hosted cluster or Amazon AWS cloud.

Register at http://www.gputechconf.com/page/gtc-express-webinar.html.

## Portable LDPC Decoding on Multicores Using OpenCL

October 24th, 2012Abstract:

This article proposes to address, in a tutorial style, the benefits of using Open Computing Language (OpenCL) as a quick way to allow programmers to express and exploit parallelism in signal processing algorithms, such as those used in error-correcting code systems. In particular, we will show how multiplatform kernels can be developed straightforwardly using OpenCL to perform computationally intensive low-density parity-check (LDPC) decoding, targeting them to run on a large set of worldwide disseminated multicore architectures, such as x86 general- purpose multicore central processing units (CPUs) and graphics processing units (GPUs). Moreover, devices with different architectures can be orchestrated to cooperatively execute these signal processing applications programmed in OpenCL. Experimental evaluation of the parallel kernels programmed with the OpenCL framework shows that high-performance can be achieved for distinct parallel computing architectures with low programming effort.

The complete source code developed and instructions for compiling and executing the program are available at http://www.co.it.pt/ldpcopencl for signal processing programmers who wish to engage with more advanced features supported by OpenCL.

(G. Falcao, V. Silva, L. Sousa and J. Andrade: *“Portable LDPC Decoding on Multicores Using OpenCL [Applications Corner]”*, IEEE Signal Processing Magazine 29:4(81-109), July 2012. [DOI])

## Parallel Sparse Approximate Inverse Preconditioning on Graphic Processing Units

October 22nd, 2012Abstract:

Accelerating numerical algorithms for solving sparse linear systems on parallel architectures has attracted the attention of many researchers due to their applicability to many engineering and scientific problems. The solution of sparse systems often dominates the overall execution time of such problems and is mainly solved by iterative methods. Preconditioners are used to accelerate the convergence rate of these solvers and reduce the total execution time. Sparse Approximate Inverse (SAI) preconditioners are a popular class of preconditioners designed to improve the condition number of large sparse matrices and accelerate the convergence rate of iterative solvers for sparse linear systems. We propose a GPU accelerated SAI preconditioning technique called GSAI, which parallelizes the computation of this preconditioner on NVIDIA graphic cards. The preconditioner is then used to enhance the convergence rate of the BiConjugate Gradient Stabilized (BiCGStab) iterative solver on the GPU. The SAI preconditioner is generated on average 28 and 23 times faster on the NVIDIA GTX480 and TESLA M2070 graphic cards respectively compared to ParaSails (a popular implementation of SAI preconditioners on CPU) single processor/core results. The proposed GSAI technique computes the SAI preconditioner in approximately the same time as ParaSails generates the same preconditioner on 16 AMD Opteron 252 processors.

(Maryam Mehri Dehnavi, David Fernandez, Jean-Luc Gaudiot and Dennis Giannacopoulos: *“Parallel Sparse Approximate Inverse Preconditioning on Graphic Processing Units”,* IEEE Transactions on Parallel and Distributed Systems (to appear). [DOI])

## CUDA 5 Production Release Now Available

October 15th, 2012The CUDA 5 Production Release is now available as a free download at www.nvidia.com/getcuda.

This powerful new version of the pervasive CUDA parallel computing platform and programming model can be used to accelerate more of applications using the following four (and many more) new features.

• CUDA Dynamic Parallelism brings GPU acceleration to new algorithms by enabling GPU threads to directly launch CUDA kernels and call GPU libraries.

• A new device code linker enables developers to link external GPU code and build libraries of GPU functions.

• NVIDIA Nsight Eclipse Edition enables you to develop, debug and optimize CUDA code all in one IDE for Linux and Mac OS.

• GPUDirect Support for RDMA provides direct communication between GPUs in different cluster nodes

As a demonstration of the power of Dynamic Parallelism and device code linking, CUDA 5 includes a device-callable version of the CUBLAS linear algebra library, so threads already running on the GPU can invoke CUBLAS functions on the GPU. Read the rest of this entry »