High-level, directive-based programming models have been rapidly gaining traction as a portable, productive means to develop application code for multicore platforms and accelerators. Due to their usability and portability, programming APIs such as OpenMP and OpenACC, are increasingly being adopted as an alternative to lower-level APIs such as CUDA and OpenCL. This workshop focuses on the use of directives to program accelerators, such as NVIDIA/AMD GPUs and coprocessors such as Intel’s Xeon Phi. It will also provide a forum for HPC application developers and technical managers to learn more about these high-level, directive-based programming strategies and their usage. It will also offer an opportunity for users to discuss their experiences and express their needs with respect to such a programming interface. More information: http://www.cacds.uh.edu/?q=ONGWorkshop
From a recent product announcement:
DeepCloud Whirlwind is an analytics only SQL database using modern GPUs for accelerated SQL processing. We see over 700x performance increase over a “well known” database on the same machine. Features include:
- column based storage
- vector processing
- SSD optimized
- smart compression – Ultra fast compression and decompression on the GPU
- MySQL like API – works with many MySQL client tools
- Oracle subset dialect
- data skipping
- zone maps
- fast schema-light data loading
Use Whirlwind database technology to get maximum database performance from significantly cheaper hardware or go all out with a state of the art system built from modern components. Beta avalable now under the GPL at: http://deepcloud.co
This tutorial will begin with a brief overview of OpenCL and data-parallelism before focusing on the GPU programming model. We will explore the fundamentals of GPU kernels, host and device responsibilities, OpenCL syntax and work-item hierarchy. For more information and to register visit: http://acceleware.com/event/introduction-opencl-using-amd-gpus
The 23rd High Performance Computing Symposium (April 12-15, 2015 in Alexandria, VA, USA) is devoted to the impact of high performance computing and communications on computer simulations. Topics of interest include:
- GPU for general purpose computations (GPGPU)
- Hybrid system modeling and simulation
- Tools and environments for coupling parallel codes
- Parallel algorithms and architectures
- High performance software tools
Submission deadline for full papers: November 22, 2014. More information can be found at http://hosting.cs.vt.edu/hpc2015.
CUDPP release 2.2 is a feature release that adds a new parallel primitive and improves some existing primitives. We have added cudppSuffixArray, a parallel skew algorithm (SA) implementation that computes the suffix array of a string. This suffix array primitive is now used in burrowsWheelerTransform, delivering better performance than CUDPP 2.1’s use of cudppStringSort. The new BWT is further used in cudppCompress, which is now faster than the original parallel compression and supports compression of text containing all possible unsigned char values. Some bugs in cudppMoveToFrontTransform and cudppStringSort have also been fixed. OS X users might also be interested in how we supported the use of OS X’s clang compiler in OS X Mavericks (10.9).
SpeedIT FLOW is a RANS single-phase fluid flow solver that runs fully on GPU. Benchmark results on external aero flow and other industry-relevant OpenFOAM cases on a GPU card indicate approximately 3x faster time to solution vs. Intel Xeon E5649 running 12 cores. This is about two times faster than competing solutions that offer only partial acceleration on GPU. More details are available on this blog.
In the course of less than a decade, Graphics Processing Units (GPUs) have evolved from narrowly scoped application specific accelerators to general-purpose parallel machines capable of accommodating an ever-growing set of algorithms. At the same time, programming GPUs appears to have become trapped around an attractor characterised by ad-hoc practices, non-portable implementations and inexact, uninformative performance reporting. The purpose of this paper is two-fold, on one hand pursuing an in-depth look at GPU hardware and its characteristics, and on the other demonstrating that portable, generic, mathematically grounded programming of these machines is possible and desirable. An agent-based meta-heuristic, the Max-Min Ant System (MMAS), provides the context. The major contributions brought about by this article are the following: (1) an optimal, portable, generic-algorithm based MMAS implementation is derived; (2) an in-depth analysis of AMD’s Graphics Core Next (GCN) GPU and the C++ AMP programming model is supplied; (3) a more robust approach to performance reporting is presented; (4) novel techniques for raising the abstraction level without sacrificing performance are employed. This represents the first implementation of an algorithm from the Ant Colony Optimisation (ACO) family using C++ AMP, whilst at the same time being one of the first uses of the latter programming environment.
(A. Voicu: “Accelerated Combinatorial Optimization using Graphics Processing Units and C++ AMP ”. International Journal of Computer Applications 100(6):21-30, August 2014. [DOI])
This hands-on four day course teaches how to write and optimize applications that fully leverage the multi-core processing capabilities of the GPU. More details and registration: http://acceleware.com/training/986
Hybrid Fortran is an Open Source directive based extension for the Fortran language. It is a way for HPC programmers to keep writing Fortran code like they are used to – only now with GPGPU support. It achieves performance portability by allowing different storage orders and loop structures for the CPU and GPU version. All computational code stays the same as in the respective CPU version, e.g. it can be kept in a low dimensionality even when the GPU version needs to be privatised in more dimensions in order to achieve a speedup. Hybrid Fortran takes care of the necessary transformations at compile-time (so there is no runtime overhead). A (python based) preprocessor parses these annotations together with the Fortran user code structure, declarations, accessors and procedure calls, and then writes separate versions of the code – once for CPU with OpenMP parallelization and once for GPU with CUDA Fortran. More details: http://typhooncomputing.com/?p=416
The course on Antenna Synthesis (with elements of GPU computing) is organized in the framework of the European School of Antennas. The course will take place at the Partenope Conference Center of the Università di Napoli Federico II, Napoli, Italy, on October 13-17, 2014. It faces three topics corresponding to the two main aspects of Antenna Synthesis, namely external and internal synthesis, and to numerical and implementation issues on High Performance Computing (HPC) platforms of synthesis algorithms. For details about the course please see this brochure and http://www.antennasvce.org/Community/Education/Courses?id_folder=533.