We present an improved matrix–matrix multiplication routine (General Matrix Multiply [GEMM]) in the MAGMA BLAS library that targets the NVIDIA Fermi graphics processing units (GPUs) using Compute Unified Data Architecture (CUDA). We show how to modify the previous MAGMA GEMM kernels in order to make a more efficient use of the Fermi’s new architectural features, most notably their extended memory hierarchy and memory sizes. The improved kernels run at up to 300 GFlop/s in double precision and up to 645 GFlop/s in single precision arithmetic (on a C2050), which is correspondingly 58% and 63% of the theoretical peak. We compare the improved kernels with the currently available version in CUBLAS 3.1. Further, we show the effect of the new kernels on higher-level dense linear algebra (DLA) routines such as the one-sided matrix factorizations, and compare their performances with corresponding, currently available routines running on homogeneous multicore systems.
(R. Nath and S. Tomov and J. Dongarra: “An Improved MAGMA GEMM For Fermi Graphics Processing Units”, International Journal of High Performance Computing Applications. 24(4), 511-515, 2010. [DOI] [PREPRINT])
Paper submission is now open for GPGPU5 which will be held March 3, 2012 in London, UK, and co-located with ACM ASPLOS XVII. The goal of this workshop is to provide a forum to discuss new and emerging general-purpose purpose programming environments and platforms, as well as evaluate applications that have been able to harness the horsepower provided by these platforms. This year’s work is particularly interested on new heterogeneous GPU platforms. For more information, visit: www.ece.neu.edu/GPGPU/GPGPU5
Submissions are invited for the fifth special session on Computational Intelligence on Consumer Games and Graphics Hardware (CIGPU-2012) to be held in Brisbane, Australia as part of the IEEE World Congress on Computational Intelligence, 10-15 June 2012. More information can be found at http://www.cs.ucl.ac.uk/staff/W.Langdon/cigpu/.
Network intrusion detection systems are faced with the challenge of identifying diverse attacks, in extremely high speed networks. For this reason, they must operate at multi-Gigabit speeds, while performing highly-complex per-packet and per-ﬂow data processing. In this paper, we present a multi-parallel intrusion detection architecture tailored for high speed networks. To cope with the increased processing throughput requirements, our system parallelizes network trafﬁc processing and analysis at three levels, using multi-queue NICs, multiple CPUs, and multiple GPUs. The proposed design avoids locking, optimizes data transfers between the different processing units, and speeds up data processing by mapping different operations to the processing units where they are best suited. Our experimental evaluation shows that our prototype implementation based on commodity off-the-shelf equipment can reach processing speeds of up to 5.2 Gbit/s with zero packet loss when analyzing trafﬁc in a real network, whereas the pattern matching engine alone reaches speeds of up to 70 Gbit/s, which is an almost four times improvement over prior solutions that use specialized hardware.
(Giorgos Vasiliadis, Michalis Polychronakis, and Sotiris Ioannidis: “MIDeA: A Multi-Parallel Intrusion Detection Architecture”, Proceedings of the 18th ACM Conference on Computer and Communications Security (CCS), Oct. 2011. [PDF])
Pattern matching is a highly computationally intensive operation used in a plethora of applications. Unfortunately, due to the ever increasing storage capacity and link speeds, the amount of data that needs to be matched against a given set of patterns is growing rapidly. In this paper, we explore how the highly parallel computational capabilities of commodity graphics processing units (GPUs) can be exploited for high-speed pattern matching. We present the design, implementation, and evaluation of a pattern matching library running on the GPU, which can be used transparently by a wide range of applications to increase their overall performance. The library supports both string searching and regular expression matching on the NVIDIA CUDA architecture. We have also explored the performance impact of different types of memory hierarchies, and present solutions
to alleviate memory congestion problems. The results of our performance evaluation using off-the-self graphics processors demonstrate that GPU-based pattern matching can reach tens of gigabits per second on different workloads.
(Giorgos Vasiliadis, Michalis Polychronakis and Sotiris Ioannidis: “Parallelization and Characterization of Pattern Matching using GPUs”, Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). November 2011. [PDF])
Exposure Render is a Direct Volume Rendering Application that applies progressive Monte Carlo raytracing, coupled with physically based light transport to heterogeneous volumetric data. Exposure Render enables the configuration of any number of arbitrarily shaped area lights, models a real-world camera, including its lens and aperture, and incorporates complex materials, whilst still maintaining interactive display updates. It features both surface and volumetric scattering, and applies noise reduction to remove the unwanted startup noise associated with progressive Monte Carlo rendering. The complete implementation is available in source and binary forms under a permissive free software license.
Hardware and compiler techniques for mapping data-parallel programs with divergent control flow to SIMD architectures have recently enabled the emergence of new GPGPU programming models such as CUDA, OpenCL, and DirectX Compute. The impact of branch divergence can be quite different depending upon whether the program’s control flow is structured or unstructured. In this paper, we show that unstructured control flow occurs frequently in applications and can lead to significant code expansion when executed using existing approaches for handling branch divergence. This paper proposes a new technique for automatically mapping arbitrary control flow onto SIMD processors that relies on a concept of a “Thread Frontier”, which is a statically bounded region of the program
containing all threads that have branched away from the current warp. This technique is evaluated on a GPU emulator configured to model i) a commodity GPU (Intel Sandybridge), and ii) custom hardware support not realized in current GPU architectures. It is shown that this new technique performs identically to the best existing method for structured control flow, and re-converges at the earliest possible point when executing unstructured control flow. This leads to i) between 1.5-633.2% reductions in dynamic instruction counts for several real applications, ii) simplification of the compilation process, and iii) ability to efficiently add high level unstructured programming constructs (e.g., exceptions) to existing data-parallel languages.
(Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu and Sudhakar Yalamanchili: “SIMD Re-convergence at Thread Frontiers”. 44th International Symposium on Microarchitecture (MICRO 44), 2011. [WWW])
In this paper, we revisit the design of synchronization primitives—specifically barriers, mutexes, and semaphores—and how they apply to the GPU. Previous implementations are insufficient due to the discrepancies in hardware and programming model of the GPU and CPU. We create new implementations in CUDA and analyze the performance of spinning on the GPU, as well as a method of sleeping on the GPU, by running a set of memory-system benchmarks on two of the most common GPUs in use, the Tesla- and Fermi-class GPUs from NVIDIA. From our results we define higher-level principles that are valid for generic many-core processors, the most important of which is to limit the number of atomic accesses required for a synchronization operation because atomic accesses are slower than regular memory accesses. We use the results of the benchmarks to critique existing synchronization algorithms and guide our new implementations, and then define an abstraction of GPUs to classify any GPU based on the behavior of the memory system. We use this abstraction to create suitable implementations of the primitives specifically targeting the GPU, and analyze the performance of these algorithms on Tesla and Fermi. We then predict performance on future GPUs based on characteristics of the abstraction. We also examine the roles of spin waiting and sleep waiting in each primitive and how their performance varies based on the machine abstraction, then give a set of guidelines for when each strategy is useful based on the characteristics of the GPU and expected contention.
(Jeff A. Stuart and John D. Owens: “Efficient Synchronization Primitives for GPUs”, submitted October 2011. [ARXIV]).
A paper detailing several possible avenues to expand MPI to accelerators has just been presented at “Architectures and System for Big Data (ASBD) 2011″, a workshop at PACT 2011. The abstract and a link to the paper are both below. We (the authors) are looking for feedback as to which options seem attractive to GPU programmers and developers. We welcome any comments/thoughts/critiques you might have.
Current trends in computing and system architecture point towards a need for accelerators such as GPUs to have inherent communication capabilities. We review previous and current software libraries that provide pseudo-communication abilities through direct message passing. We show how these libraries are beneficial to the HPC community, but are not forward-thinking enough. We give motivation as to why MPI should be extended to support these accelerators, and provide a road map of achievable milestones to complete such an extension, some of which require advances in hardware and device drivers.
(Jeff A. Stuart, Pavan Balaji and John D. Owens, “Extending MPI to Accelerators”, PACT 2011 Workshop Series: Architectures and Systems for Big Data, October 2011. [WWW])