This paper considers the l1-compressive sensing problem. It proposes an algorithm specifically designed to take advantage of shared memory, vectorized, parallel and many-core microprocessors such as the Cell processor, new generation Graphics Processing Units (GPUs) and standard vectorized multi-core processors (e.g. quad core CPUs). The paper also gives evidence of the efficiency of its approach and compares the algorithm on the three platforms, exhibiting pros and cons for each of them. (A Simple Compressive Sensing Algorithm for Parallel Many-Core Architectures. Alexandre Borghi, Jerome Darbon, Sylvain Peyronnet, Tony F. Chan and Stanley Osher. UCLA Computational and Applied Mathematics Technical Report. September 2008.)
This paper presents an application-oriented approach to block cipher processing on GPUs. A new block-based conventional implementation of AES on an Nvidia G80 is shown with 4-10x speed improvements over CPU implementations and 2-4x speed increase over the previous fastest AES GPU implementation. Presented also is a general purpose data structure for representing cryptographic client requests which is suitable for execution on a GPU. The issues related to the mapping of this general structure to the GPU are explored. Finally presented is the first analysis of the main encryption modes of operation on a GPU, showing the performance and behavioural implications of executing these modes under the outlined general-purpose data model. (Practical Symmetric Key Cryptography on Modern Graphics Hardware. Owen Harrison and John Waldron, 17th USENIX Security Symposium. 2008.)
The following is excerpted from an NVIDIA press release.
Installation has begun on a new computational resource at the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign. Lincoln will deliver peak performance of 62.3 teraflops and is designed to push the envelope in the use of heterogeneous processors for scientific computing. The system is expected to be online in October, bringing NCSA’s total computational resources to nearly 170 teraflops.
Lincoln will consist of 192 compute nodes (Dell PowerEdge 1950 III dual-socket nodes with quad-core Intel Harpertown 2.33GHz processors and 16GB of memory) and 96 NVIDIA Tesla S1070 accelerator units. Each Tesla unit provides 500 gigaflops of double-precision performance and 16GB of memory. Lincoln’s InfiniBand interconnect fabric will be linked to the interconnect fabric of Abe, the 89-teraflop cluster that is currently NCSA’s largest resource. This will enable certain applications to run across the entire complex, providing a peak “Abe Lincoln” performance of 152 teraflops.
This paper by Wojek et al. presents a fast object class localization framework from TU Darmstadt implemented on a data parallel architecture currently available in recent computers. Our case study, the implementation of Histograms of Oriented Gradients (HOG) descriptors, shows that just by using this recent programming model we can easily speed up an original CPU-only implementation by a factor of 34 (with disk IO) / 109 (processing only), making it unnecessary to use early rejection cascades that sacrifice classification performance, even in real-time conditions. Using recent techniques to program the Graphics Processing Unit (GPU) allows our method to scale up to the latest, as well as to future improvements of the hardware.(Sliding-Windows for Rapid Object Class Localization: a Parallel Technique. C. Wojek, G. Dorko, A. Schulz, B. Schiele.30th DAGM Symposium (DAGM 2008), pp. 71-81, Munich, Germany)
This paper describes the design and implementation of Mars, a MapReduce framework, on graphics processors (GPUs). MapReduce is a distributed programming framework originally proposed by Google for the ease of development of web search applications on a large number of commodity CPUs. Compared with CPUs, GPUs have an order of magnitude higher computation power and memory bandwidth, but can be harder to program because their architectures are designed as a special-purpose co-processor and they have only recently introduced non-graphics programming interfaces. The authors developed Mars on an NVIDIA G80 GPU, which contains 128 processors, and evaluated it in comparison with Phoenix, the state-of-the-art MapReduce framework on multi-core CPUs. Mars hides the programming complexity of the GPU behind the simple and familiar MapReduce interface. It is up to 16 times faster than its CPU-based counterpart for six common web applications on a quad-core machine. Additionally, the authors propose a MapReduce framework with coprocessing between the GPU and the CPU for further performance improvement. Mars is developed by Bingsheng He (HKUST) and Wenbin Fang(HKUST) under the supervision of Naga K. Govindaraju (Microsoft Corp.), Qiong Luo (HKUST), and Tuyong Wang (Sina.com). Source code of Mars can be downloaded from the authors’ website. (A MapReduce Framework on Graphics Processors. Bingsheng He, Wenbin Fang, Qiong Lo, Naga K. Govindaraju, and Tuyong Want. To appear in PACT 2008.)
A wide class of numerical methods needs to solve a linear system, where the matrix pattern of non-zero coefficients can be arbitrary. These problems can greatly benefit from highly multithreaded computational power and large memory bandwidth available on GPUs, especially since dedicated general purpose APIs such as CTM (AMD-ATI) and CUDA (NVIDIA) have appeared. CUDA even provides a BLAS implementation, but only for dense matrices (CuBLAS). Other existing linear solvers for the GPU are also limited by their internal matrix representation. This paper describes how to combine recent GPU programming techniques and new GPU dedicated APIs with high performance computing strategies (namely block compressed row storage, register blocking and vectorization), to implement a sparse general-purpose linear solver. This implementation of the Jacobi-preconditioned Conjugate Gradient algorithm outperforms by up to a factor of 6.0x leading-edge CPU counterparts, making it attractive for applications which are content with single precision. (Concurrent number cruncher – A GPU implementation of a general sparse linear solver. Luc Buatois, Guillaume Caumon and Bruno LÃ©vy. International Journal of Parallel, Emergent and Distributed Systems. To Appear.)
This paper by Takizawa et al. at Tohoku University describes a programming framework named Stream Programming with Runtime Auto-Tuning (SPRAT) that combines a high-level programming language with runtime processor selection. Today, a commodity PC can be seen as a hybrid computing system equipped with two different kinds of processors, i.e. CPU and GPU. Since the superiorities of GPUs in the performance and the power efficiency strongly depend on the system configuration and the data size determined at run time, a programmer cannot always know which processor should be used to execute a certain kernel. Therefore, this paper describes the SPRAT framework, which dynamically selects an appropriate processor so as to improve energy efficiency. The evaluation results clearly indicate that the run-time processor selection on execution of each kernel with the given data streams is promising for energy-aware computing on a hybrid computing system. (SPRAT:Runtime Processor Selection for Energy-aware Computing. Hiroyuki Takizawa, Katuto Sato, and Hiroaki Kobayashi. To appear in Proceedings of IEEE Cluster 2008 (the 3rd international workshop on automatic performance tuning).)
GPU4Vision is a project founded by the Institute for Computer Graphics and Vision, Graz University of Technology dealing with fast computer vision algorithms for tasks like basic image processing, segmentation, motion, stereo etc. On the GPU4Vision website you can take a look at the project’s latest scientific publications, watch demo videos of algorithms and even download and evaluate some of them on your own PC. (GPU4Vision – Website)
This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as some fixed function logic blocks. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads. It also greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. A coherent on-die 2nd level cache allows efficient inter-processor communication and high-bandwidth local data access by CPU cores. Task scheduling is performed entirely with software in Larrabee, rather than in fixed function logic. The customizable software graphics rendering pipeline for this architecture uses binning in order to reduce required memory bandwidth, minimize lock contention, and increase opportunities for parallelism relative to standard GPUs. The Larrabee native programming model supports a variety of highly parallel applications that use irregular data structures. Performance analysis on those applications demonstrates Larrabeeâ€™s potential for a broad range of parallel computation
Big improvements in the performance of graphics processing units (GPUs) turned them into a compelling platform for high performance computing. In this thesis, we discuss the usage of NVIDIA’s CUDA in two applications — Einstein@Home, a distributed computing software, and OpenSteer, a game-like application. Our work on Einstein@Home demonstrates that CUDA can be integrated into existing applications with minimal changes, even in programs designed without considering GPU usage. However the existing data structure of Einstein@Home performs poorly when used on the GPU. We demonstrate that using a redesigned data structure improves the performance to about three times as fast as the original CPU version, even though the code executed on the device is not optimized. We further discuss the design of a novel spatial data structure called “dynamic grid” that is optimized for CUDA usage. We measure its performance by integrating it into the Boids scenario of OpenSteer. Our new concept outperforms a uniform grid by a margin of up to 15%, even though the dynamic grid still provides optimization potential.
(Case studies on gpu usage and data structure design. J. Breitbart, Master’s thesis, Universität Kassel, 2008)