We present a hybrid algorithm to compute convex hull of points in three and higher dimensional spaces. Our formulation uses a GPU-based interior point filter to cull away many of the points that do not belong to the boundary. The convex hull of remaining points is computed on the CPU. The GPU-based filter proceeds in an incremental manner and computes a pseudo-hull that is contained inside the convex hull of the original points. The pseudo-hull computation involves only localized operations and therefore, maps well to GPU architectures. Furthermore, the underlying approach extends to high dimensional point sets and deforming points. In practice, our culling filter can reduce the number of candidate points by two orders of magnitude. We have implemented the hybrid algorithm on commodity GPUs, and evaluated its performance on several large point sets. In practice, the GPU-based filtering algorithm can cull up to 85M interior points per second on NVIDIA GeForce GTX 580 and the hybrid algorithm improves the overall performance of convex hull computation by 10-27 times (for static point sets) and 22-46 times (for deforming point sets).
(Min Tang, Jie-yi Zhao, Ruofeng Tong, and Dinesh Manocha: “GPU accelerated Convex Hull Computation”, accepted by SMI’2012. [WWW] [PREPRINT])
We study the use of a GPU for the numerical approximation of the curvature dependent flows of graphs – the mean-curvature flow and the Willmore flow. Both problems are often applied in image processing where fast solvers are required. We approximate these problems using the complementary finite volume method combined with the method of lines. We obtain a system of ordinary differential equations which we solve by the Runge–Kutta–Merson solver. It is a robust solver with an automatic choice of the integration time step. We implement this solver on CPU but also on GPU using the CUDA toolkit. We demonstrate that the mean-curvature flow can be successfully approximated in single precision arithmetic with the speed-up almost 17 on the Nvidia GeForce GTX 280 card compared to Intel Core 2 Quad CPU. On the same card, we obtain the speed-up 7 in double precision arithmetic which is necessary for the fourth order problem – the Willmore flow of graphs. Both speed-ups were achieved without affecting the accuracy of the approximation. The article is structured in such way that the reader interested only in the implementation of the Runge–Kutta–Merson solver on the GPU can skip the sections containing the mathematical formulation of the problems.
(Oberhuber T., Suzuki A., Žabka V.: “The CUDA implementation of the method of lines for the curvature dependent flows”, Kybernetika 47(2):251–272, 2011. [PDF])
We describe our FE-gMG solver, a finite element geometric multigrid approach for problems relying on unstructured grids. We augment our GPU- and multicore-oriented implementation technique based on cascades of sparse matrix-vector multiplication by applying strong smoothers. In particular, we employ Sparse Approximate Inverse (SPAI) and Stabilised Approximate Inverse (SAINV) techniques. We focus on presenting the numerical efficiency of our smoothers in combination with low- and high-order finite element spaces as well as the hardware efficiency of the FE-gMG. For a representative problem and computational grids in 2D and 3D, we achieve a speedup of an average of 5 on a single GPU over a multithreaded CPU code in our benchmarks. In addition, our strong smoothers can deliver a speedup of 3-5 depending on the element space, compared to simple Jacobi smoothing. This can even be enhanced to a factor of 7 when combining the usage of Approximate Inverse-based smoothers with clever sorting of the degrees of freedom. In total the FE-gMG solver can outperform a simple, (multicore-)CPU-based multigrid by a total factor of over 40.
(Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac and Stefan Turek: “Towards a complete FEM-based simulation toolkit on GPUs: Unstructured Grid Finite Element Geometric Multigrid solvers with strong smoothers based on Sparse Approximate Inverses”, accepted for publication in Computers and Fluids, 2011. [preprint])
Savant is a asymptotic ray-tracing CEM tool used to predict the performance of antennas installed on electrically large platforms, including far-field antenna patterns, near-field distributions, and antenna-to-antenna coupling. Savant is based on the shooting and bouncing rays (SBR) formulation. While asymptotic solvers like Savant have significantly smaller computational and memory requirements for electrically large problems than full-wave techniques, the computation costs still increase significantly with frequency and simulation fidelity, and such solvers benefit greatly from parallelization techniques. Graphics processing units (GPUs) are throughput-oriented processing devices that are well suited for the mathematically intensive workloads found in CEM solvers. Current GPUs contain hundreds of processing units, leverage thousands of threads, and can execute over one trillion floating-point operations per second. A hybrid CPU and GPU parallelization approach has been developed for Savant, providing significant speedups compared to CPU-only implementations. Results from the execution of GPU-accelerated Savant on multiple case studies will be presented.
(T. Courtney, J. E. Stone and R. Kipp, “Using GPUs to Accelerate installed antenna performance simulations,” Proc. Allerton Antenna Symposium, Sept. 2011, Monticello, IL. [PDF])
In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applications—a ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P).
Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures.
(Pennycook, S.J., Hammond, S.D., Mudalige, G.R., Wright, S.A. and Jarvis, S.A.: “On the Acceleration of Wavefront Applications using Distributed Many-Core Architectures”, The Computer Journal (in press) [DOI] [PREPRINT])
A GPU-based parallel star retrieval method is proposed to improve the efficiency of searching stars from star catalogue in computer simulation, especially when the FOV (Field of View) is large. By the novel algorithm, the stars in catalogue are classified and stored in different zones using latitude and longitude zoning method firstly. Based on the easily accessible star catalogue, the star zones that FOV covers can be computed exactly by constructing a spherical triangle around the FOV. As a result, the searching scope is reduced effectively. Finally, we use CUDA computation architecture to run the process of star retrieving from those star zones parallel on GPU. Experimental results show that, in comparison with CPU-oriented implementation, the proposed algorithm achieves up to tens of times speedup, and the processing time is limited within a millisecond level in large FOV and wide star magnitude span. It meets the requirement of real-time simulation.
(Chao Li, Liqiang Zhang, Jiaze Wu, and Changwen Zheng, “Parallel Accelerating for Star Catalogue Retrieval Algorithm using GPUs”, Journal of Astronautics, 2012)
In order to test the function and performance of star sensor on the ground, a fast method for simulating star map is presented. The algorithm adopts instantanesous coordinate of star and improves the star searching efficiency by optimizing the zone partitioning method for star catalogue. We overcome the low accuracy of the latitude and longitude’s span that FOV overlays by proposing a new spherical right-angled triangle method and the searching scope is reduced highly; meanwhile, the simulation model for star brightness is also built based on adopted star catalogue. Simulation study is conducted for the demonstration of the algorithm. The proposed approach meets the requirement of wide magnitude range and short simulation period.
(Chao Li, Changwen Zheng, Jiaze Wu, and Liqiang Zhang, “A fast algorithm of simulating star map for star sensor”, Proceedings of the 3rd IEEE International Conferernce on Computer and Network Technology (IEEE ICCNT), 2011)
Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA) libraries, and therefore have to be highly optimized. We present some techniques and implementations that signiﬁcantly accelerate the corresponding routines from currently available libraries for GPUs. In particular, Pointer Redirecting – a set of GPU speciﬁc optimization techniques –allows us to easily remove performance oscillations associated with problem dimensions not divisible by ﬁxed blocking sizes. For example, applied to the matrix-matrix multiplication routines, depending on the hardware conﬁguration and routine parameters, this can lead to two times faster algorithms. Similarly, the matrix-vector multiplication can be accelerated more than two times in both single and double precision arithmetic. Additionally, GPU speciﬁc acceleration techniques are applied to develop new kernels (e.g. syrk, symv) that are up to 20x faster than the currently available kernels. We present these kernels and also show their acceleration e!ect to higher level dense linear algebra routines. The accelerated kernels are now freely available through the MAGMA BLAS library.
(R. Nath, S. Tomov and J. Dongarra: “Accelerating GPU Kernels for Dense Linear Algebra”, VECPAR 2010. [PDF])
We present an improved matrix–matrix multiplication routine (General Matrix Multiply [GEMM]) in the MAGMA BLAS library that targets the NVIDIA Fermi graphics processing units (GPUs) using Compute Unified Data Architecture (CUDA). We show how to modify the previous MAGMA GEMM kernels in order to make a more efficient use of the Fermi’s new architectural features, most notably their extended memory hierarchy and memory sizes. The improved kernels run at up to 300 GFlop/s in double precision and up to 645 GFlop/s in single precision arithmetic (on a C2050), which is correspondingly 58% and 63% of the theoretical peak. We compare the improved kernels with the currently available version in CUBLAS 3.1. Further, we show the effect of the new kernels on higher-level dense linear algebra (DLA) routines such as the one-sided matrix factorizations, and compare their performances with corresponding, currently available routines running on homogeneous multicore systems.
(R. Nath and S. Tomov and J. Dongarra: “An Improved MAGMA GEMM For Fermi Graphics Processing Units”, International Journal of High Performance Computing Applications. 24(4), 511-515, 2010. [DOI] [PREPRINT])
Network intrusion detection systems are faced with the challenge of identifying diverse attacks, in extremely high speed networks. For this reason, they must operate at multi-Gigabit speeds, while performing highly-complex per-packet and per-ﬂow data processing. In this paper, we present a multi-parallel intrusion detection architecture tailored for high speed networks. To cope with the increased processing throughput requirements, our system parallelizes network trafﬁc processing and analysis at three levels, using multi-queue NICs, multiple CPUs, and multiple GPUs. The proposed design avoids locking, optimizes data transfers between the different processing units, and speeds up data processing by mapping different operations to the processing units where they are best suited. Our experimental evaluation shows that our prototype implementation based on commodity off-the-shelf equipment can reach processing speeds of up to 5.2 Gbit/s with zero packet loss when analyzing trafﬁc in a real network, whereas the pattern matching engine alone reaches speeds of up to 70 Gbit/s, which is an almost four times improvement over prior solutions that use specialized hardware.
(Giorgos Vasiliadis, Michalis Polychronakis, and Sotiris Ioannidis: “MIDeA: A Multi-Parallel Intrusion Detection Architecture”, Proceedings of the 18th ACM Conference on Computer and Communications Security (CCS), Oct. 2011. [PDF])