General purpose GPU recently has successfully drawn attention from high-performance computing due to higher core density and lower EPI value than CPU. The newest report of Top500 shows that there are thirty-nine supercomputing systems using GPUs to accelerate data computation: two Chinese systems called Tianhe-1A and Nebulaeare at No. 2 and No. 4 and one Japanese system called Tsubame 2.0 at No. 5 are on this list. Amazon has announced the availability of Cluster GPU Instances for Amazon EC2 to deliver the computational power of GPUs in Clouds. More and more researchers have used GPU clusters instead of CPU clusters for resolving their massive-computation problems such as high energy physics, scientific simulation, data mining, climate forecast, and earthquake prediction. As the impact of GPU on both of the academic and engineering areas is increasing rapidly, many issues of GPU cluster computing have to be addressed further in order for improving and enriching the user experience and applications of GPU cluster computing. For example, the complexity of the GPU programming such as CUDA and OpenCL is too high for users to move their applications towards this new computing platform since these programming interfaces and models are quite different from MPI or OpenMP, which are popularly used in CPU cluster computing. In addition, users lack friendly and efficient tools such as debugger and performance analyzer during the period of program development. On the other hand, the computing systems built on GPU clusters require useful tools in emergence to effectively monitor and manage GPU resources for system throughput and to maintain the QoS and reliability of the execution of user applications. As previously described, this special issue is aimed at providing a forum for researchers to present their innovative design, implementation, and experience in software of GPU cluster computing. We encourage authors to submit high-quality, original, unpublished papers. Potential topics include, but are not limited to:
SnuCL is an OpenCL framework and freely available, open-source software developed at Seoul National University. It naturally extends the original OpenCL semantics to the heterogeneous cluster environment. The target cluster consists of a single host node and multiple compute nodes. They are connected by an interconnection network, such as Gigabit and InfiniBand switches. The host node contains multiple CPU cores and each compute node consists of multiple CPU cores and multiple GPUs. For such clusters, SnuCL provides an illusion of a single heterogeneous system for the programmer. A GPU or a set of CPU cores becomes an OpenCL compute device. SnuCL allows the application to utilize compute devices in a compute node as if they were in the host node. Thus, with SnuCL, OpenCL applications written for a single heterogeneous system with multiple OpenCL compute devices can run on the cluster without any modifications. SnuCL achieves both high performance and ease of programming in a heterogeneous cluster environment.
SnuCL consists of SnuCL runtime and compiler. The SnuCL compiler is based on the OpenCL C compiler in SNU-SAMSUNG OpenCL framework. Currently, the SnuCL compiler supports x86, ARM, and PowerPC CPUs, AMD GPUs, and NVIDIA GPUs.
Acceleware has announced two training courses:
Developed in partnership with AMD, this four day course, August 21-24,2012, is designed for GPU Programmers who are looking to develop comprehensive skills in writing and optimizing applications that fully leverage the multi-core processing capabilities of the GPU. Register before July 31 and receive $200 off your course fee! Enter promotional code AXTEB2012.
Partnering with NVIDIA, this four day course (July 17-20, 2012) is designed for Programmers who are looking to develop comprehensive skills in writing and optimizing applications that fully leverage the multi-core processing capabilities of the GPU.
C++ Accelerated Massive Parallelism (C++ AMP) is a new open specification heterogeneous programming model, which builds on the established C++ language. Developed for heterogeneous platforms/computing C++ AMP is designed to accelerate the execution of your C++ code by taking advantage of the data-parallel hardware that is commonly present as a GPU and multi-core CPU. This four day course is aimed at programmers who are looking to develop comprehensive skills in writing and optimizing applications using C++ AMP.
Delivered by Acceleware’s Developers (as opposed to trained trainers!), the course is designed for programmers looking to acquire comprehensive skills in accelerating applications through parallel programming. Read the rest of this entry »
Beyond3D’s first C++ AMP focused contest accepts submissions until August 31, 2012. The contest’s goal is to use parallel programming in order to speed up solving the Traveling Salesman’s Problem. All relevant details are provided on the contest’s dedicated page.
A wide range of applications in engineering and scientific computing are involved in the acceleration of the sparse matrix vector product (SpMV). Graphics Processing Units (GPUs) have recently emerged as platforms that yield outstanding acceleration factors. SpMV implementations for GPUs have already appeared on the scene. This work is focused on the ELLR-T algorithm to compute SpMV on GPU architecture, its performance is strongly dependent on the optimum selection of two parameters. Therefore, taking account that the memory operations dominate the performance of ELLR-T, an analytical model is proposed in order to obtain the auto-tuning of ELLR-T for particular combinations of sparse matrix and GPU architecture. The evaluation results with a representative set of test matrices show that the average performance achieved by auto-tuned ELLR-T by means of the proposed model is near to the optimum. A comparative analysis of ELLR-T against a variety of previous proposals shows that ELLR-T with the estimated configuration reaches the best performance on GPU architecture for the representative set of test matrices.
(Francisco Vázquez and José Jesús Fernández and Ester M. Garzón: “Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach”, Parallel Computing 38(8), 408-420, Aug. 2012. [DOI])
In this paper, we show how to employ Graphics Processing Units (GPUs) to provide an efficient and high performance solution for finding frequent items in data streams. We discuss several design alternatives and present an implementation that exploits the great capability of graphics processors in parallel sorting. We provide an exhaustive evaluation of performances, quality results and several design trade-offs. On an off-the-shelf GPU, the fastest of our implementations can process over 200 million items per second, which is better than the best known solution based on Field Programmable Gate Arrays (FPGAs) and CPUs. Moreover, in previous approaches, performances are directly related to the skewness of the input data distribution, while in our approach, the high throughput is independent from this factor.
(Ugo Erra, Bernardino Frola: “Frequent Items Mining Acceleration Exploiting Fast Parallel Sorting on the GPU”, Procedia Computer Science 9, pp 86-95 (Proceedings of the International Conference on Computational Science), 2012. [DOI])
In this work, we describe a simple and powerful method to implement real-time multi-agent path-ﬁnding on Graphics Processor Units (GPUs). The technique aims to ﬁnd potential paths for many thousands of agents, using the A* algorithm and an input grid map partitioned into blocks. We propose an implementation for the GPU that uses a search space decomposition approach to break down the forward search A* algorithm into parallel independently forward sub-searches. We show that this approach ﬁts well with the programming model of GPUs, enabling planning for many thousands of agents in parallel in real-time applications such as computer games and robotics. The paper describes this implementation using the Compute Uniﬁed Device Architecture programming environment, and demonstrates its advantages in GPU performance compared to GPU implementation of Real-Time Adaptive A*.
(Giuseppe Caggianese , Ugo Erra: “GPU Accelerated Multi-agent Path Planning Based on Grid Space Decomposition”, Procedia Computer Science 9, pp 1847-1856 (Proceedings of the International Conference on Computational Science), 2012. [DOI])
A novel algorithm for computing the incomplete-LU and Cholesky factorization with 0 fill-in on a graphics processing unit (GPU) is proposed. It implements the incomplete factorization of the given matrix in two phases. First, the symbolic analysis phase builds a dependency graph based on the matrix sparsity pattern and groups the independent rows into levels. Second, the numerical factorization phase obtains the resulting lower and upper sparse triangular factors by iterating sequentially across the constructed levels. The Gaussian elimination of the elements below the main diagonal in the rows corresponding to each single level is performed in parallel. The numerical experiments are also presented and it is shown that the numerical factorization phase can achieve on average more than 2.8x speedup over MKL, while the incomplete-LU and Cholesky preconditioned iterative methods can achieve an average of 2x speedup on GPU over their CPU implementation.
(Maxim Naumov. Parallel Incomplete-LU and Cholesky Factorization in the Preconditioned Iterative Methods on the GPU, NVIDIA Technical Report NVR-2012-003. May 2012.)
Register today up now for a webinar series on how to use the Intel® SDK for OpenCL Applications to best utilize the CPU and Intel® HD Graphics of 3rd Gen Intel® Core™ processors for developing OpenCL applications:
- July 11 – Getting Started with Intel® SDK for OpenCL Applications
- July 18 – Writing Efficient Code for OpenCL Applications
- July 25 – Creating and Optimizing OpenCL Applications