This publication describes efficient low level algorithms for performing relational queries on parallel processors, such as NVIDIA Fermi or Kepler. Relations are stored in GPU memory as sorted arrays of tuples, and manipulated by relational operators that preserve the sorted property. Most significantly, this work introduces algorithms for JOIN and SET INTERSECTION/UNION/DIFFERENCE that can process data at over 50 GB/s.
Relational databases remain an important application domain for organizing and analyzing the massive volume of data generated as sensor technology, retail and inventory transactions, social media, computer vision, and new fields continue to evolve. At the same time, processor architectures are beginning to shift towards hierarchical and parallel architectures employing throughput-optimized memory systems, lightweight multi-threading, and Single-Instruction Multiple-Data (SIMD) core organizations. Examples include general purpose graphics processing units (GPUs) such as NVIDIA’s Fermi, Intels Sandy Bridge, and AMD’s Fusion processors. This paper explores the mapping of primitive relational algebra operations onto GPUs. In particular, we focus on algorithms and data structure design identifying a fundamental conflict between the structure of algorithms with good computational complexity and that of algorithms with memory access patterns and instruction schedules that achieve peak machine utilization. To reconcile this conflict, our design space exploration converges on a hybrid multi-stage algorithm that devotes a small amount of the total runtime to prune input data sets using an irregular algorithm with good computational complexity. The partial results are then fed into a regular algorithm that achieves near peak machine utilization. The design process leading to the most efficient algorithm for each stage is described, detailing alternative implementations, their performance characteristics, and an explanation of why they were ultimately abandoned. The least efficient algorithm (JOIN) achieves 57% − 72% of peak machine performance depending on the density of the input. The most efficient algorithms (PRODUCT, PROJECT, and SELECT) achieve 86% − 92% of peak machine performance across all input data sets. To the best of our knowledge, these represent the best known published results to date for any implementations. This work lays the foundation for the development of a relational database system that achieves good scalability on a Multi-Bulk-Synchronous-Parallel (M-BSP) processor architecture. Additionally, the algorithm design may offer insights into the design of parallel and distributed relational database systems. It leaves the problems of query planning, operator→query synthesis, corner case optimization, and system/OS interaction as future work that would be necessary for commercial deployment.
(Gregory Diamos, Ashwin Lele, Jin Wang, Sudhakar Yalamanchili: “Relational Algorithms for Multi-Bulk-Synchronous Processors “, NVIDIA Tech Report, June 2012. [WWW])
Together with EuroPar-12, the 5th Workshop on UnConventional High Performance Computing 2012 (UCHPC 2012) will take place on August 27/28 at Rhodes Island, Greece. The workshop tries to capture solutions for HPC which are unconventional today but could become conventional and significant tomorrow. While GPGPU is already used a lot in HPC, there still are all kind of issues around best exploitation and productivity for the programmer. Submission deadline: June 6, 2012. For more details, see
http://www.lrr.in.tum.de/~weidendo/uchpc12. UPDATE: Submission deadline extended to June 11.
VexCL is vector expression template library for OpenCL developed by the Supercomputer Center of Russian academy of sciences. It has been created for ease of C++ based OpenCL development. Multi-device (and multi-platform) computations are supported. The code is publicly available under MIT license.
- Selection and initialization of compute devices according to extensible set of device filters.
- Transparent allocation of device vectors spanning multiple devices.
- Convenient notation for vector arithmetic, sparse matrix-vector multiplication, reductions. All computations are performed in parallel on all selected devices.
- Appropriate kernels for vector expressions are generated automatically first time an expression is used.
Doxygen-generated documentation is available at http://ddemidov.github.com/vexcl/index.html.
C++ Accelerated Massive Parallelism (C++ AMP) is a new open specification heterogeneous programming model, which builds on the established C++ language. Developed for heterogeneous platforms C++ AMP is designed to accelerate the execution of C++ code by taking advantage of the data-parallel hardware that is commonly present as a GPU. These courses are aimed at programmers who are looking to develop comprehensive skills in writing and optimizing applications using C++ AMP. Read the rest of this entry »
SpeedIT provides a set of accelerated solvers for sparse linear systems of equations. The library supports C/C++ and Fortran, and it can be used with OpenFOAM to accelerate CFD simulations. SpeedIT 2.1 contains two new preconditioners:
• Algebraic Multigrid with Smoothed Aggregation (AMG)
• Approximate Inverse (AINV)
OpenFOAM simulations on the GPU can be up to 3.5x faster compared to CG and DIC/DILU preconditioners on the CPU and up to 1.6x faster if you run GAMG.
See the SpeedIT website and blog for more details.
Traditional CPU-based computing environments offer a variety of binary instrumentation frameworks. Instrumentation and analysis tools for GPU environments to date have been more limited. Panoptes is a binary instrumentation framework for CUDA that targets the GPU. By exploiting the GPU to run modified kernels, computationally-intensive programs can be run at the native parallelism of the device during analysis. To demonstrate its instrumentation capabilities, we currently implement a memory addressability and validity checker that targets CUDA programs.
Panoptes traces targeted programs by library interposition at runtime. Read the rest of this entry »
NVIDIA Kepler GK110 Die Shot
This white paper describes the new Kepler GK110 Architecture from NVIDIA.
Comprising 7.1 billion transistors, Kepler GK110 is not only the fastest, but also the most architecturally complex microprocessor ever built. Adding many new innovative features focused on compute performance, GK110 was designed to be a parallel processing powerhouse for Tesla® and the HPC market.
Kepler GK110 will provide over 1 TFlop of double precision throughput with greater than 80% DGEMM efficiency versus 60‐65% on the prior Fermi architecture.
In addition to greatly improved performance, the Kepler architecture offers a huge leap forward in power efficiency, delivering up to 3x the performance per watt of Fermi.
The paper describes features of the Kepler GK110 architecture, including
- Dynamic Parallelism;
- Grid Management Unit;
- NVIDIA GPUDirect™;
- New SHFL instruction and atomic instruction enhancements;
- New read-only data cache previously only accessible to texture;
- Bindless Textures;
- and much more.
This report describes advantages of using GPUs for analytical queries. It compares performance of the Alenka database engine using a GPU with the performance of Oracle on a SPARC server. More information on Alenka including source code: https://github.com/antonmks/Alenka
TunaCode has released CUVILib v1.2, a library to accelerate imaging and computer vision applications. CUVILib adds acceleration to Imaging applications from Medical, Industrial and Defense domains. It delivers very high performance and supports both CUDA and OpenCL. Modules include color operations (demosaic, conversions, correction etc), linear/non-linear filtering, feature extraction & tracking, motion estimation, image transforms and image statistics.
More information, including a free trial version: http://www.cuvilib.com/
Motivation: New high-throughput sequencing technologies have promoted the production of short reads with dramatically low unit cost. The explosive growth of short read datasets poses a challenge to the mapping of short reads to reference genomes, such as the human genome, in terms of alignment quality and execution speed.
Results: We present CUSHAW, a parallelized short read aligner based on the compute unified device architecture (CUDA) parallel programming model. We exploit CUDA-compatible graphics hardware as accelerators to achieve fast speed. Our algorithm employs a quality-aware bounded search approach based on the Burrows- Wheeler transform (BWT) and the Ferragina Manzini (FM)-index to reduce the search space and achieve high alignment quality. Performance evaluation, using simulated as well as real short read datasets, reveals that our algorithm running on one or two graphics processing units (GPUs) achieves significant speedups in terms of execution time, while yielding comparable or even better alignment quality for paired-end alignments compared to three popular BWT-based aligners: Bowtie, BWA and SOAP2. CUSHAW also delivers competitive performance in terms of SNP calling for an E.coli test dataset.
(Y. Liu, B. Schmidt, D. Maskell: “CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform”, Bioinformatics, 2012. [DOI])