Query co-processing on graphics processors (GPUs) has become an effective means to improve the performance of main memory databases. However, this co-processing requires the data transfer between the main memory and the GPU memory via a low-bandwidth PCI-E bus. The overhead of such data transfer becomes an important factor, even a bottleneck, for query co-processing performance on the GPU. In this paper, we propose to use compression to alleviate this performance problem. Specifically, we implement nine lightweight compression schemes on the GPU and further study the combinations of these schemes for a better compression ratio. We design a compression planner to find the optimal combination. Our experiments demonstrate that the GPU-based compression and decompression achieved a processing speed up to 45 and 56 GB/s respectively. Using partial decompression, we were able to significantly improve GPU-based query co-processing performance. As a side product, we have integrated our GPU-based compression into MonetDB, an open source column-oriented DBMS, and demonstrated the feasibility of offloading compression and decompression to the GPU.
(Wenbin Fang, Bingsheng He, Qiong Luo: “Database Compression on Graphics Processors”, PVLDB/VLDB 2010. Link to PDF.)
We present PacketShader, a high-performance software router framework for general packet processing with Graphics Processing Unit (GPU) acceleration. PacketShader exploits the massively-parallel processing power of GPU to address the CPU bottleneck in current software routers. Combined with our high-performance packet I/O engine, PacketShader outperforms existing software routers by more than a factor of four, forwarding 64B IPv4 packets at 39 Gbps on a single commodity PC. We have implemented IPv4 and IPv6 forwarding, OpenFlow switching, and IPsec tunneling to demonstrate the flexibility and performance advantage of PacketShader. The evaluation results show that GPU brings significantly higher throughput over the CPU-only implementation, confirming the effectiveness of GPU for computation and memory-intensive operations in packet processing.
(Sangjin Han, Keon Jang, KyoungSoo Park and Sue Moon: “PacketShader: A GPU-accelerated Software Router”, Proceedings of ACM SIGCOMM 2010, Delhi, India, September 2010. Project webpage. DOI)
With an increasing amount of data and user demands for fast query processing, the optimization of database operations continues to be a challenging task. A common optimization method is to leverage parallel hardware architectures. With the introduction of general-purpose GPU computing, massively parallel hardware has become available within commodity hardware. To efficiently exploit this technology, we introduce the method of speculative query processing. This speculative query processing works on, but is not limited to, a prefix tree structure to efficiently support heavily used database index operations. Fundamentally, our developed approach traverse a prefix tree structure in a speculative, parallel way instead of a step-by-step traversing. To show the benefits and opportunities of our novel approach, we present an exhaustive evaluation on a graphical processing unit.
(Volk, P. B.; Habich, D.; Lehner, W.: “GPU-Based Speculative Query Processing for Database Operations”. First International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures (ADMS’10), held in conjunction with VLDB 2010, September 2010. Link to project webpage.)
A Modern Graphics Processing unit (GPU) is able to perform massively parallel scientific computations at low cost. We extend our implementation of the checkerboard algorithm for the two-dimensional Ising model in order to overcome the memory limitations of a single GPU which enables us to simulate significantly larger systems. Using multi-spin coding techniques, we are able to accelerate simulations on a single GPU by factors up to 35 compared to an optimized single Central Processor Unit (CPU) core implementation which employs multi-spin coding. By combining the Compute Unified Device Architecture (CUDA) with the Message Parsing Interface (MPI) on the CPU level, a single Ising lattice can be updated by a cluster of GPUs in parallel. For large systems, the computation time scales nearly linearly with the number of GPUs used. As proof of concept we reproduce the critical temperature of the 2D Ising model using finite size scaling techniques.
(Benjamin Block, Peter Virnau and Tobias Preis: “Multi-GPU accelerated multi-spin Monte Carlo simulations of the 2D Ising model”, Computer Physics Communications 181:9, 1549-1556, Sep. 2010. DOI Link. arXiv link)
This paper presents a complete modular approach to computing bivariate polynomial resultants on Graphics Processing Units (GPU). Given two polynomials, the algorithm first maps them to a prime field for sufficiently many primes, and then processes each modular image individually. We evaluate each polynomial at several points and compute a set of univariate resultants for each prime in parallel on the GPU. The remaining “combine” stage of the algorithm comprising polynomial interpolation and Chinese remaindering is also executed on the graphics processor. The GPU algorithm returns coefficients of the resultant as a set of Mixed Radix (MR) digits. Finally, the large integer coefficients are recovered from the MR representation on the host machine. With the approach of displacement structure and efficient modular arithmetic we have been able to achieve more than 100x speed-up over a CPU-based resultant algorithm from Maple 13.
(Pavel Emeliyanenko: “A complete modular resultant algorithm targeted for realization on graphics hardware”, Proceedings of the 4th International Workshop on Parallel and Symbolic Computation (PASCO2010), pages 35-43, Grenoble, France, July 2010. DOI link. Direct PDF link.)
The Swarm-NG package helps scientists and engineers harness the power of GPUs. In the early releases, Swarm-NG will focus on the integration of an ensemble of N-body systems evolving under Newtonian gravity. Swarm-NG does not replicate existing libraries that calculate forces for large-N systems on GPUs, but rather focuses on integrating an ensemble of many systems where N is small. This is of particular interest for astronomers who study the chaotic evolution of planetary systems. In the long term, we hope Swarm-NG will allow for the efficient parallel integration of user-defined systems of ordinary differential equations.
We describe a parallel hybrid symplectic integrator for planetary system integration that runs on a graphics processing unit (GPU). The integrator identifies close approaches between particles and switches from symplectic to Hermite algorithms for particles that require higher resolution integrations. The integrator is approximately as accurate as other hybrid symplectic integrators but is GPU accelerated.
(Alexander Moore and Alice C. Quillen: “QYMSYM: A GPU-Accelerated Hybrid Symplectic Integrator That Permits Close Encounters”. preprint on arXiv, available code)
From the paper’s abstract:
A wide class of finite element electromagnetic applications requires computing very large sparse matrix vector multiplications (SMVM). Due to the sparsity pattern and size of the matrices, solvers can run relatively slowly. The rapid evolution of graphic processing units (GPUs) in performance, architecture and programmability make them very attractive platforms for accelerating computationally intensive kernels such as SMVM. This work presents a new algorithm to accelerate the performance of the SMVM kernel on graphic processing units.
From the paper’s conclusion:
We have introduced several efficient techniques to accelerate the execution of the sparse matrix vector multiplication (SMVM) on NVIDIA graphic processing units. The proposed methods increased the performance of the SMVM kernel on GT 8800 up to 18.8 times compared to the quad core CPU and 3 times compared to previous work by Bell and Garland on accelerating SMVM for GPUs.
(M. Mehri Dehnavi, D. Fernandez and D. Giannacopoulos: “Finite element sparse matrix vector multiplication on GPUs”. IEEE Transactions on Magnetics, vol. 46, no. 8, pp. 2982-2985, August 2010. DOI 10.1109/TMAG.2010.2043511)
Ocelot is a dynamic compilation framework designed to map the explicitly data parallel execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms. Ocelot includes a dynamic binary translator from Parallel Thread eXecution ISA (PTX) to many-core processors that leverages the Low Level Virtual Machine (LLVM) code generator to target x86 and other ISAs. The dynamic compiler is able to execute existing CUDA binaries without recompilation from source and supports switching between execution on an NVIDIA GPU and a many-core CPU at runtime. It has been validated against over 130 applications taken from the CUDA SDK, the UIUC Parboil benchmark, the Virginia Rodinia benchmarks, the GPU-VSIPL signal and image processing library, the Thrust library, and several domain specific applications.
This paper presents a high level overview of the implementation of the Ocelot dynamic compiler highlighting design decisions and trade-offs, and showcasing their effect on application performance. Several novel code transformations are explored that are applicable only when compiling explicitly parallel applications and traditional dynamic compiler optimizations are revisited for this new class of applications. This study is expected to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures.
This paper identifies several key areas of research and open problems for optimizing the performance of data parallel programs (such as CUDA and OpenCL) that were encountered when designing a binary translator from PTX to LLVM/x86. The complete implementation of Ocelot is available open-source under the new BSD license at http://code.google.com/p/gpuocelot. Ongoing work involves translating PTX to AMD’s IL allowing CUDA programs to be executed on AMD GPUs, developing parallel-aware PTX to PTX optimizations, and exploring new programming and execution models that are layered on PTX.
(Gregory Diamos, Andrew Kerr, Sudhakar Yalamanchili and Nathan Clark: “Ocelot: A dynamic compiler for bulk-synchroneous applications in heterogeneous systems”. 19 International Conference on Parallel Architectures and Compilation Techniques (PACT2010), September 2010).
Simbios, the NIH Center for Biomedical Computation at Stanford University, is excited to announce the release of OPENMM 2.0.
OPENMM was designed to enhance the performance of almost any molecular dynamics simulation package (MD package) by allowing the code to be executed on high performance computer architectures, in particular Graphics Processing Units (GPUs). Most molecular dynamics packages can be modified to call OPENMM, resulting in significant acceleration on such high performance architectures, without changing the way users interact with the MD package. Read the rest of this entry »