Feature detection and extraction are essential in computer vision applications such as image matching and object recognition. The Scale-Invariant Feature Transform (SIFT) algorithm is one of the most robust approaches to detect and extract distinctive invariant features from images. However, high computational complexity makes it difficult to apply the SIFT algorithm to mobile applications. Recent developments in mobile processors have enabled heterogeneous computing on mobile devices, such as smartphones and tablets. In this paper, we present an OpenCL-based implementation of the SIFT algorithm on a smartphone, taking advantage of the mobile GPU. We carefully analyze the SIFT workloads and identify the parallelism. We implemented major steps of the SIFT algorithm using both serial C++ code and OpenCL kernels targeting mobile processors, to compare the performance of different workflows. Based on the profiling results, we partition the SIFT algorithm between the CPU and GPU in a way that best exploits the parallelism and minimizes the buffer transferring time to achieve better performance. The experimental results show that we are able to achieve 8.5 FPS for keypoints detection and 19 FPS for descriptor generation without reducing the number and the quality of the keypoints. Moreover, the heterogeneous implementation can reduce energy consumption by 41% compared to an optimized CPU-only implementation.
(Guohui Wang, Blaine Rister, and Joseph R. Cavallaro: “Workload Analysis and Efficient OpenCL-based Implementation of SIFT Algorithm on a Smartphone”, 1st IEEE Global Conference on Signal and Information Processing (GlobalSIP), Dec. 2013, [PDF])
In this paper, we present a high throughput and low latency LDPC (low-density parity-check) decoder implementation on GPUs (graphics processing units). The existing GPU-based LDPC decoder implementations suffer from low throughput and long latency, which prevent them from being used in practical SDR (software-defined radio) systems. To overcome this problem, we present optimization techniques for a parallel LDPC decoder including algorithm optimization, fully coalesced memory access, asynchronous data transfer and multi-stream concurrent kernel execution for modern GPU architectures. Experimental results demonstrate that the proposed LDPC decoder achieves 316Mbps (at 10 iterations) peak throughput on a single GPU. The decoding latency, which is much lower than that of the state of the art, varies from 0.207ms to 1.266ms for different throughput requirements from 62.5Mbps to 304.16Mbps. When using four GPUs concurrently, we achieve an aggregate peak throughput of 1.25Gbps (at 10 iterations).
(Guohui Wang, Michael Wu, Bei Yin, and Joseph R. Cavallaro: “High Throughput Low Latency LDPC Decoding on GPU for SDR Systems”, 1st IEEE Global Conference on Signal and Information Processing (GlobalSIP), Dec. 2013. [PDF])
Fastvideo have released their JPEG codec for NVIDIA GPUs. Peak performance of the codec reaches 6 GBytes per second and higher for images loadedfrom host RAM. For instance, a full-color 4K image with resolution 3840 x 2160 can be compressed by 10 times in merely 6 milliseconds on NVIDIA GeForce GTX Titan. More information: http://www.fastcompression.com
All course material from the full-day CUDA tutorial at PPAM 2013 are now available at http://gpgpu.org/ppam2013. The tutorial was held on Sunday, Sep. 8 2013 in Warsaw, Poland.
The 22nd High Performance Computing Symposium (April 13-16, 2014 in Tampa, FL, USA) is devoted to the impact of high performance computing and communications on computer simulations. Topics of interest include:
- GPU for general purpose computations (GPGPU)
- Hybrid system modeling and simulation
- Tools and environments for coupling parallel codes
- Parallel algorithms and architectures
- High performance software tools
Submission deadline for full papers: October 25, 2013. More information can be found at http://www.iue.tuwien.ac.at/hpc2014.html.
This blog takes a closer look at constant cache and read-only cache. It highlights the differences between the two memory types and what circumstances they perform best in. Read the whole story here.
Sparse matrix-vector multiplication (spMVM) is the most time-consuming kernel in many numerical algorithms and has been studied extensively on all modern processor and accelerator architectures. However, the optimal sparse matrix data storage format is highly hardware-specific, which could become an obstacle when using heterogeneous systems. Also, it is as yet unclear how the wide single instruction multiple data (SIMD) units in current multi- and many-core processors should be used most efficiently if there is no structure in the sparsity pattern of the matrix. We suggest SELL-C-sigma, a variant of Sliced ELLPACK, as a SIMD-friendly data format which combines long-standing ideas from General Purpose Graphics Processing Units (GPGPUs) and vector computer programming. We discuss the advantages of SELL-C-sigma compared to established formats like Compressed Row Storage (CRS) and ELLPACK, and show its suitability on a variety of hardware platforms (Intel Sandy Bridge, Intel Xeon Phi and Nvidia Tesla K20) for a wide range of test matrices from different application areas. Using appropriate performance models we develop deep insight into the data transfer properties of the SELL-C-sigma spMVM kernel. SELL-C-sigma comes with two tuning parameters whose performance impact across the range of test matrices is studied and for which reasonable choices are proposed. This leads to a hardware-independent (“catch-all”) sparse matrix format, which achieves very high efficiency for all test matrices across all hardware platforms.
(M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop: “A unified sparse matrix data format for modern processors with wide SIMD units.” Submitted, July 2013 [preprint])
Acceleware, partnering with NVIDIA, if offering a four-day CUDA training course designed for programmers in the oil and gas industry who are looking to develop comprehensive skills in writing and optimizing applications that fully leverage the many-core processing capabilities of the GPU. Commonly used algorithms in oil and gas such as filtering and FFTs will be used and profiled in the examples. The case study on day 4 will focus on efficient implementation of 3D convolution, which is highly applicable to reverse time migration. A background in oil and gas is not necessary. Register here.
Acceleware training program manager Kelly Goss is hosting a 4 day C++ AMP training course in Boston, MA September 10 -13. The course will cover an overview of GPU computing, C++ 11 Lambda syntax, data-parallel architectures and the C++ AMP programming model. Every day consists of lectures and hands-on exercises (laptops for student’s use provided for the duration of the course). Click here for full details and registration.
The rise of multi- and many-core architectures also gave birth to a plethora of new parallel programming models. Among these, the open industry standard OpenCL addresses this heterogeneity of programming environments by providing a uniﬁed programming framework. The price to pay, however, is that OpenCL requires additional low-level boilerplate code, when compared to vendor-speciﬁc solutions, even if only simple operations are to be performed. Also, the uniﬁed programming framework does not automatically provide any guarantees on performance portability of a particular implementation. Thus, device-speciﬁc compute kernels are still required for obtaining good performance across different hardware architectures.
We address both, the issue of programmability and portable performance, in this work: On the one hand, a high-level programming interface for linear algebra routines allows for the convenient speciﬁcation of the operations of interest without having to go into the details of the underlying hardware. On the other hand, we discuss the underlying generator for device-speciﬁc OpenCL kernels at runtime, which is supplemented by an auto-tuning framework for portable performance as well as with work partitioning and task scheduling for multiple devices. Our benchmark results show portable performance across hardware from major vendors. In all cases, at least 75 percent of the respective vendor tuned library was obtained, while in some cases we even outperformed the reference. We further demonstrate the convenient and efficient use of our high-level interface in a multi-device setting with good scalability.
(Philippe Tillet, Karl Rupp, Siegfried Selberherr, Chin-Teng Lin: “Towards Performance-Portable, Scalable, and Convenient Linear Algebra”. 5th USENIX Workshop on Hot Topics in Parallelism (HotPar’) 2013 [PDF].)
This webinar, scheduled for Wednesday, August 7 at 10 a.m. PDT, will cover the latest schedule for GPUDirect RDMA, scaling and optimization techniques for maximizing application performance using MVAPICH2, and the latest advancements of CUDA. Join speakers from Ohio State University, NVIDIA and Mellanox Technologies. Register by visiting www.gputechconf.com/gtcexpress.