SpeedIT FLOW is a RANS single-phase fluid flow solver that runs fully on GPU. Benchmark results on external aero flow and other industry-relevant OpenFOAM cases on a GPU card indicate approximately 3x faster time to solution vs. Intel Xeon E5649 running 12 cores. This is about two times faster than competing solutions that offer only partial acceleration on GPU. More details are available on this blog.
A new book titled “Numerical Computations with GPUs” has been published:
This book brings together research on numerical methods adapted for Graphics Processing Units (GPUs). It explains recent efforts to adapt classic numerical methods, including solution of linear equations and FFT, for massively parallel GPU architectures. This volume consolidates recent research and adaptations, covering widely used methods that are at the core of many scientific and engineering computations. Each chapter is written by authors working on a specific group of methods; these leading experts provide mathematical background, parallel algorithms and implementation details leading to reusable, adaptable and scalable code fragments. This book also serves as a GPU implementation manual for many numerical algorithms, sharing tips on GPUs that can increase application efficiency. The valuable insights into parallelization strategies for GPUs are supplemented by ready-to-use code fragments. Numerical Computations with GPUs targets professionals and researchers working in high performance computing and GPU programming. Advanced-level students focused on computer science and mathematics will also find this book useful as secondary text book or reference.
From the table of contents: Read the rest of this entry »
PARALUTION is a library for sparse iterative methods which can be performed on various parallel devices, including multi-core CPU, GPU (CUDA and OpenCL) and Intel Xeon Phi. The new 0.7.0 version provides the following new features:
- Windows support – full windows support for all backends (CUDA, OpenCL, OpenMP)
- Assembling function – new OpenMP parallel assembling function for sparse matrices (includes an update function for time-dependent problems)
- Direct (dense) solvers (for very small problems)
- (Restricted) Additive Schwarz preconditioners
- MATLAB/Octave plug-in
To avoid OpenMP overhead for small sized problems, the library will compute in serial if the size of the matrix/vector is below a pre-defined threshold. Internally, the OpenCL backend has been modified for simplified cross platform compilation.
PARALUTION is a library for sparse iterative methods which can be performed on various parallel devices, including multi-core CPU, GPU (CUDA and OpenCL) and Intel Xeon Phi. The new 0.6.0 version provides the following new features:
- Windows support (OpenMP backend)
- FGMRES (Flexible GMRES)
- (R)CMK (Cuthill–McKee) ordering
- Thread-core affiliation (for Host OpenMP)
- Asynchronous transfers (CUDA backend)
- Pinned memory allocation on the host when using CUDA backend
- Verbose output for debugging
- Easy to handle timing function in the examples
PARALUTION 0.6.0 is available at http://www.paralution.com.
PARALUTION is a library for sparse iterative methods which can be performed on various parallel devices, including multi-core CPU and GPU. In the new 0.4.0 version, the library provides also a backend for Xeon Phi (MIC). With this new version, various performance benchmarks based on vector-vector routines, sparse matrix-vector multiplication and CG method on different backends have been released: OpenMP/CUDA/OpenCL- NVIDIA GPU, AMD GPU, CPU and Xeon Phi. More information: http://www.paralution.com/benchmarks/
Numerical Computations with GPUs, to be published by Springer, will contain a collection of articles on core numerical methods adapted for Graphics Processing Units (GPUs). Classical numerical methods (solution of linear equations, FFT, etc.) are at the core of many scientific and engineering computations. In recent years substantial efforts were undertaken to adapt these methods for recently emerged GPU-based systems. The book is envisioned as a consolidation of such work into a single volume covering widely used methods and techniques. Each chapter will provide mathematical background, parallel algorithm, and implementation details leading to reusable, adaptable, and scalable code fragments. Each chapter will be accompanied with a basic CUDA or OpenCL source code that can be used by the readers as a starting point for adaptation in their applications. The book will serve as a GPU implementation manual for many numerical algorithms providing valuable insights into parallelization strategies for GPUs as well as ready-to-use code fragments with a broad appeal to both developers and researchers interested in GPU computing.
Authors interested in contributing to this volume are asked to submit a short proposal via EasyChair (https://www.easychair.org/conferences/?conf=ncgpu14) by October 15, 2013. Authors of the accepted/invited chapters are expected to write and submit to the editor completed chapters by January 31, 2014. For more details see full solicitation (http://www.ncsa.illinois.edu/~kindr/editorial/ncgpu/solicitation.pdf) or contact the Editor at firstname.lastname@example.org.
The rise of multi- and many-core architectures also gave birth to a plethora of new parallel programming models. Among these, the open industry standard OpenCL addresses this heterogeneity of programming environments by providing a uniﬁed programming framework. The price to pay, however, is that OpenCL requires additional low-level boilerplate code, when compared to vendor-speciﬁc solutions, even if only simple operations are to be performed. Also, the uniﬁed programming framework does not automatically provide any guarantees on performance portability of a particular implementation. Thus, device-speciﬁc compute kernels are still required for obtaining good performance across different hardware architectures.
We address both, the issue of programmability and portable performance, in this work: On the one hand, a high-level programming interface for linear algebra routines allows for the convenient speciﬁcation of the operations of interest without having to go into the details of the underlying hardware. On the other hand, we discuss the underlying generator for device-speciﬁc OpenCL kernels at runtime, which is supplemented by an auto-tuning framework for portable performance as well as with work partitioning and task scheduling for multiple devices. Our benchmark results show portable performance across hardware from major vendors. In all cases, at least 75 percent of the respective vendor tuned library was obtained, while in some cases we even outperformed the reference. We further demonstrate the convenient and efficient use of our high-level interface in a multi-device setting with good scalability.
(Philippe Tillet, Karl Rupp, Siegfried Selberherr, Chin-Teng Lin: “Towards Performance-Portable, Scalable, and Convenient Linear Algebra”. 5th USENIX Workshop on Hot Topics in Parallelism (HotPar’) 2013 [PDF].)
Communicating data within the graphic processing unit (GPU) memory system and between the CPU and GPU are major bottlenecks in accelerating Krylov solvers on GPUs. Communication-avoiding techniques reduce the communication cost of Krylov subspace methods by computing several vectors of a Krylov subspace “at once,” using a kernel called “matrix powers.” The matrix powers kernel is implemented on a recent generation of NVIDIA GPUs and speedups of up to 5.7 times are reported for the communication-avoiding matrix powers kernel compared to the standards prase matrix vector multiplication (SpMV) implementation.
(M. Mehri Dehnavi, Y. El-Kurdi, J. Demmel and D. Giannacopoulos: “Communication-Avoiding Krylov Techniques on Graphic Processing Units”, IEEE Transactions on Magnetics 49(5):1749-1752, May 2013. [DOI])
PARALUTION is a library for sparse iterative methods with special focus on multi-core and accelerator technology such as GPUs. In particular, it incorporates fine-grained parallel preconditioners designed to expolit modern multi-/many-core devices. Based on C++, it provides a generic and flexible design and interface which allow seamless integration with other scientific software packages. The library is open source and released under GPL. Key features are:
- OpenMP, CUDA and OpenCL support
- No special hardware/library requirement
- Portable code and results across all hardware
- Many sparse matrix formats
- Various iterative solvers/preconditioners
- Generic and robust design
- Plug-in for the finite element package Deal.II
- Documentation: user manual (pdf), reports, doxygen
More information, including documentation and case studies, is available at http://www.paralution.com.
The paper presents techniques for generating very large finite-element matrices on a multicore workstation equipped with several graphics processing units (GPUs). To overcome the low memory size limitation of the GPUs, and at the same time to accelerate the generation process, we propose to generate the large sparse linear systems arising in finite-element analysis in an iterative manner on several GPUs and to use the graphics accelerators concurrently with CPUs performing collection and addition of the matrix fragments using a fast multithreaded procedure. The scheduling of the threads is organized in such a way that the CPU operations do not affect the performance of the process, and the GPUs are idle only when data are being transferred from GPU to CPU. This approach is verified on two workstations: the first consists of two 6-core Intel Xeon X5690 processors with two Fermi GPUs: each GPU is a GeForce GTX 590 with two graphics processors and 1.5 GB of fast RAM; the second workstation is equipped with two Tesla C2075 boards carrying 6 GB of RAM each and two 12-core Opteron 6174s. For the latter setup, we demonstrate the fast generation of sparse finite-element matrices as large as 10 million unknowns, with over 1 billion nonzero entries. Comparing with the single-threaded and multithreaded CPU implementations, the GPU-based version of the algorithm based on the ideas presented in this paper reduces the finite-element matrix-generation time in double precision by factors of 100 and 30, respectively.
(Dziekonski, A., Sypek, P., Lamecki, A. and Mrozowski, M.: “Generation of large finite-element matrices on multiple graphics processors”. International Journal on Numerical Methoths in Engineering, 2012, in press. [DOI])