CUDPP release 2.2 is a feature release that adds a new parallel primitive and improves some existing primitives. We have added cudppSuffixArray, a parallel skew algorithm (SA) implementation that computes the suffix array of a string. This suffix array primitive is now used in burrowsWheelerTransform, delivering better performance than CUDPP 2.1’s use of cudppStringSort. The new BWT is further used in cudppCompress, which is now faster than the original parallel compression and supports compression of text containing all possible unsigned char values. Some bugs in cudppMoveToFrontTransform and cudppStringSort have also been fixed. OS X users might also be interested in how we supported the use of OS X’s clang compiler in OS X Mavericks (10.9).
This hands-on four day course teaches how to write and optimize applications that fully leverage the multi-core processing capabilities of the GPU. More details and registration: http://acceleware.com/training/986
Hybrid Fortran is an Open Source directive based extension for the Fortran language. It is a way for HPC programmers to keep writing Fortran code like they are used to – only now with GPGPU support. It achieves performance portability by allowing different storage orders and loop structures for the CPU and GPU version. All computational code stays the same as in the respective CPU version, e.g. it can be kept in a low dimensionality even when the GPU version needs to be privatised in more dimensions in order to achieve a speedup. Hybrid Fortran takes care of the necessary transformations at compile-time (so there is no runtime overhead). A (python based) preprocessor parses these annotations together with the Fortran user code structure, declarations, accessors and procedure calls, and then writes separate versions of the code – once for CPU with OpenMP parallelization and once for GPU with CUDA Fortran. More details: http://typhooncomputing.com/?p=416
The course on Antenna Synthesis (with elements of GPU computing) is organized in the framework of the European School of Antennas. The course will take place at the Partenope Conference Center of the Università di Napoli Federico II, Napoli, Italy, on October 13-17, 2014. It faces three topics corresponding to the two main aspects of Antenna Synthesis, namely external and internal synthesis, and to numerical and implementation issues on High Performance Computing (HPC) platforms of synthesis algorithms. For details about the course please see this brochure and http://www.antennasvce.org/Community/Education/Courses?id_folder=533.
This blog entry provides an introduction to GPU virtualization, reviewing the five major technology vendors and their virtualization support for CUDA.
A new book titled “Numerical Computations with GPUs” has been published:
This book brings together research on numerical methods adapted for Graphics Processing Units (GPUs). It explains recent efforts to adapt classic numerical methods, including solution of linear equations and FFT, for massively parallel GPU architectures. This volume consolidates recent research and adaptations, covering widely used methods that are at the core of many scientific and engineering computations. Each chapter is written by authors working on a specific group of methods; these leading experts provide mathematical background, parallel algorithms and implementation details leading to reusable, adaptable and scalable code fragments. This book also serves as a GPU implementation manual for many numerical algorithms, sharing tips on GPUs that can increase application efficiency. The valuable insights into parallelization strategies for GPUs are supplemented by ready-to-use code fragments. Numerical Computations with GPUs targets professionals and researchers working in high performance computing and GPU programming. Advanced-level students focused on computer science and mathematics will also find this book useful as secondary text book or reference.
From the table of contents: Read the rest of this entry »
Partnering with NVIDIA, this four day CUDA training course, held in Houston is designed for programmers in the oil and gas industry who are looking to develop comprehensive skills in writing and optimizing applications that fully leverage the many-core processing capabilities of the GPU. Commonly used algorithms such as filtering and FFTs will be used and profiled in the examples. The case study on day 4 focuses on efficient implementation of a finite difference algorithm which is highly applicable to reverse time migration. However a background in oil and gas is not necessary. For more information and to view a copy of the course outline please visit: http://acceleware.com/training/987
A new version of the rCUDA middleware has been released (version 4.2). In addition to fix some minor bugs, the new release provides support for:
- CUDA 6.0 Runtime API
- New stream management
- cuSPARSE libraries
The rCUDA middleware allows to seamlessly use, within your cluster, GPUs that are installed in computing nodes different from the one that is executing the CUDA application, without requiring to modify your program. Please visit www.rcuda.net for more details about the rCUDA technology.
Many current high-performance clusters include one or more GPUs per node in order to dramatically reduce application execution time, but the utilization of these accelerators is usually far below 100%. In this context, emote GPU virtualization can help to reduce acquisition costs as well as the overall energy consumption. In this paper, we investigate the potential overhead and bottlenecks of several “heterogeneous” scenarios consisting of client GPU-less nodes running CUDA applications and remote GPU-equipped server nodes providing access to NVIDIA hardware accelerators. The experimental evaluation is performed using three general-purpose multicore processors (Intel Xeon, Intel Atom and ARM Cortex A9), two graphics accelerators (NVIDIA GeForce GTX480 and NVIDIA Quadro M1000), and two relevant scientific applications (CUDASW++ and LAMMPS) arising in bioinformatics and molecular dynamics simulations.
(A. Castelló, J. Duato, R. Mayo, A. J. Peña, E. S. Quintana-Ortí, V. Roca, and F. Silla, “On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications”. Fourth International Conference on Smart Grids, Green Communications and IT Energy-aware Technologies, ENERGY 2014, Chamonix (France), pp. 57–62, 20 – 24 April 2014. [PDF])
We present a cache-aware method for accelerating texture-based volume rendering on a graphics processing unit (GPU). Because a GPU has hierarchical architecture in terms of processing and memory units, cache optimization is important to maximize performance for memory-intensive applications. Our method localizes texture memory reference according to the location of the viewpoint and dynamically selects the width and height of thread blocks (TBs) so that each warp, which is a series of 32 threads processed simultaneously, can minimize memory access strides. We also incorporate transposed indexing of threads to perform TB-level cache optimization for specific viewpoints. Furthermore, we maximize TB size to exploit spatial locality with fewer resident TBs. For viewpoints with relatively large strides, we synchronize threads of the same TB at regular intervals to realize synchronous ray propagation. Experimental results indicate that our cache-aware method doubles the worst rendering performance compared to those provided by the CUDA and OpenCL software development kits.
(Yuki Sugimoto, Fumihiko Ino, and Kenichi Hagihara: “Improving Cache Locality for GPU-based Volume Rendering”. Parallel Computing 40(5/6): 59-69, May 2014. [DOI])