This paper in the Proceedings of the Institution of Civil Engineers describes an application of GPGPU for flood risk modelling by a team based at JBA Consulting in the UK. The model described here has since been used to produce flood risk maps for several countries in Europe.
“Two-dimensional (2D) flood inundation modelling is now an important part of flood risk management practice. Research in the fields of computational hydraulics and numerical methods, allied with advances in computer technology and software design, have brought 2D models into mainstream use. Even so, the models are computationally demanding and can take a long time to run, especially for large areas and at high spatial resolutions (for instance 2 × 2 m or smaller grid cells). There is thus strong motivation to accelerate 2D model codes. This paper demonstrates the use of technology from the computer graphics industry to accelerate a 2D diffusion wave (non-inertial) floodplain model. Over the past decade the market for computer games has driven the development of very fast, relatively low-cost ‘graphical processing units’. In recent years there has been a growing interest in this high-performance graphics hardware for scientific and engineering applications. This work adapted a flood model algorithm to run on a commodity personal computer graphics card. The results of a benchmark urban flood simulation were reproduced and the model run time reduced from 18 h to 9·5 min.”
(Lamb, R., Crossley, A. and Waller, S. 2009. A fast two-dimensional floodplain inundation model. Proceedings of the Institution of Civil Engineers – Water Management, Volume 162, Issue 6, pages 363–370. DOI: 10.1680/wama.2009.162.6.363)
Cellular-level agent based modelling is reliant on either sequential processing environments or expensive and largely unavailable PC grids. The GPU offers an alternative architecture for such systems, however the steep learning curve associated with the GPU’s data parallel architecture has previously limited the uptake of this emerging technology. In this paper we demonstrate a template driven agent architecture which provides a mapping of XML model specifications and C language scripting to optimised Compute Unified Device Architecture (CUDA) for the GPU. Our work is validated though the implementation of a Keratinocyte model using limited range message communication with non-linear time simulation steps to resolve intercellular forces. The performance gain achieved over existing modelling techniques reduces simulation times from hours to seconds. The improvement of simulation performance allows us to present a real-time visualisation technique which was previously unobtainable.
(Richmond Paul, Coakley Simon, Romano Daniela (2009), Cellular Level Agent Based Modelling on the Graphics Processing Unit, (Best Student Paper) Proc. of HiBi09 – High Performance Computational Systems Biology, 14-16 October 2009, Trento, Italy)
In this paper, Takizawa et al. have presented a tool named CheCUDA that is designed to checkpoint CUDA applications. As existing checkpoint/restart implementations do not support checkpointing the GPU status, CheCUDA hooks basic CUDA driver API calls in order to record the GPU status changes on the main memory. At checkpointing, CheCUDA stores the status changes in a file after copying all necessary data in the video memory to the main memory and then disabling the CUDA runtime. At restart, CheCUDA reads the file, re-initializes the CUDA runtime, and recovers the resources on GPUs so as to restart from the stored status. This paper demonstrates that a prototype implementation of CheCUDA can correctly checkpoint and restart a CUDA application written with basic APIs. This also indicates that CheCUDA can migrate a process from one PC to another even if the process uses a GPU. Accordingly, CheCUDA is useful not only to enhance the dependability of CUDA applications but also to enable dynamic task scheduling of CUDA applications required especially on heterogeneous GPU cluster systems. This paper also shows the timing overhead for checkpointing.
(Hiroyuki Takizawa, Katuto Sato, Kazuhiko Komatsu, and Hiroaki Kobayashi, CheCUDA: A Checkpoint/Restart Tool for CUDA Applications, to appear inProceedings of the Tenth International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT) 2009, Workshop on Ultra Performance and Dependable Acceleration Systems).
High-performance scientific computing has recently seen a surge of interest in heterogeneous systems, with an emphasis on modern Graphics Processing Units (GPUs). These devices offer tremendous potential for performance and efficiency in important large-scale applications of computational science. However, exploiting this potential can be challenging, as one must adapt to the specialized and rapidly evolving computing environment currently exhibited by GPUs. One way of addressing this challenge is to embrace better techniques and develop tools tailored to their needs. This article presents one simple technique, GPU run-time code generation (RTCG), and PyCUDA, an open-source toolkit that supports this technique.
In introducing PyCUDA, this article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems. It is further observed that, compared to competing techniques, the effort required to create codes using run-time code generation with PyCUDA grows more gently in response to growing needs. The concept of RTCG is simple and easily implemented using existing, robust tools. Nonetheless it is powerful enough to support (and encourage) the creation of custom application-specific tools by its users. The premise of the paper is illustrated by a wide range of examples where the technique has been applied with considerable success.
(Andreas Klöckner, Nicolas Pinto, Yunsup Lee, Bryan Catanzaro, Paul Ivanov, Ahmed Fasih. PyCUDA: GPU Run-Time Code Generation for High-Performance Computing, submitted. http://arxiv.org/abs/0911.3456)
There will be a special session on Computational Intelligence on Consumer Games and Graphics Hardware (CIGPU 2010) as part of IEEE World Congress on Computational Intelligence Conference 2010 (WCCI-2010).
Building on the success of previous CIGPU sessions and workshops, CIGPU 2010 will further explore the role that GPU technologies can play in computational intelligence (CI) research. Submissions of original research are invited on the use of parallel graphics hardware for computational intelligence. Work might involve exploring new techniques for exploiting the hardware, new algorithms to implement on the hardware, new applications for accelerated CI, new ways of making the technology available to CI researchers or the utilisation of the next generation of technologies.
“Anyone who has implemented computational intelligence techniques using any parallel graphics hardware will want to submit to this special session.”
Many graph layouts include very dense areas, making the layout difficult to understand. In this paper, we propose a technique for modifying an existing layout in order to reduce the clutter in dense areas. A physically-inspired evolution process, based on a modified heat equation is used to create an improved layout density image, making better use of available screen space. Using results from optimal mass transport problems, a warp to the improved density image is computed. The graph nodes are displaced according to the warp. The warp maintains the overall structure of the graph, thus limiting disturbances to the mental map, while reducing the clutter in dense areas of the layout. The complexity of the algorithm depends mainly on the resolution of the image visualizing the graph and is linear in the size of the graph. This allows scaling the computation according to required running times. It is demonstrated how the algorithm can be significantly accelerated using a graphics processing unit (GPU), resulting in the ability to handle large graphs in a matter of seconds. Results on several layout algorithms and applications are demonstrated.
(Yaniv Frishman, Ayellet Tal, “Uncluttering Graph Layouts Using Anisotropic Diffusion and Mass Transport”, IEEE Transactions on Visualization and Computer Graphics, vol. 15, no. 5, pp. 777-788, Sep./Oct. 2009)
The 1.0 Beta version of OpenMM has just been released. OpenMM is a freely downloadable, high performance, extensible library that allows molecular dynamics (MD) simulations to run on high performance computer architectures, such as graphics processing units (GPUs). It currently supports NVIDIA GPUs and provides preliminary support for the new cross-platform, parallel programming standard OpenCL, which will enable it to be used on ATI GPUs.
The new release includes support for Particle Mesh Ewald and custom non-bonded interactions. In conjunction with this release, a new version of the code needed for accelerating the GROMACS molecular dynamics software using OpenMM is also available.
OpenMM is a collaborative project between Vijay Pande’s lab at Stanford University and Simbios, the National Center for Physics-based Simulation of Biological Structures at Stanford, which is supported by the National Institutes of Health. For more information on OpenMM, visit http://simtk.org/home/openmm.
Monte Carlo Simulation of Photon Migration in 3D Turbid Media Accelerated by Graphics Processing UnitsNovember 23rd, 2009
We report a parallel Monte Carlo algorithm accelerated by graphics processing units (GPU) for modeling time-resolved photon migration in arbitrary 3D turbid media. By taking advantage of the massively parallel threads and low-memory latency, this algorithm allows many photons to be simulated simultaneously in a GPU. To further improve the computational efficiency, we explored two parallel random number generators (RNG), including a floating-point-only RNG based on a chaotic lattice. An efficient scheme for boundary reflection was implemented, along with the functions for time-resolved imaging. For a homogeneous semi-infinite medium, good agreement was observed between the simulation output and the analytical solution from the diffusion theory. The code was implemented with CUDA programming language, and benchmarked under various parameters, such as thread number, selection of RNG and memory access pattern. With a low-cost graphics card, this algorithm has demonstrated an acceleration ratio above 300 when using 1792 parallel threads over conventional CPU computation. The acceleration ratio drops to 75 when using atomic operations. These results render the GPU-based Monte Carlo simulation a practical solution for data analysis in a wide range of diffuse optical imaging applications, such as human brain or small-animal imaging.
(Qianqian Fang and David A. Boas, “Monte Carlo Simulation of Photon Migration in 3D Turbid Media Accelerated by Graphics Processing Units,” Opt. Express, vol. 17, issue 22, pp. 20178-20190 (2009), doi:10.1364/OE.17.020178 , link to full-text PDF
A free software, Monte Carlo eXtreme (MCX), is also available at http://mcx.sourceforge.net.)
In this work we describe a GPU implementation for an individual-based model for fish schooling. In this model each fish aligns its position and orientation with an appropriate average of its neighbors’ positions and orientations. This carries a very high computational cost in the so-called nearest neighbors search. By leveraging the GPU processing power and the new programming model called CUDA we implement an efficient framework which permits to simulate the collective motion of
high-density individual groups. In particular we present as a case study a simulation of motion of millions of fishes. We describe our implementation and present extensive experiments which
demonstrate the effectiveness of our GPU implementation.
(Ugo Erra, Bernardino Frola, Vittorio Scarano, Iain Couzin, An efficient GPU implementation for large scale individual-based simulation of collective behavior. Proceedings of High Performance Computational Systems Biology (HiBi09). October 14-16, 2009, Trento, Italy.
General-purpose application development for GPUs (GPGPU) has recently gained momentum as a cost-effective approach for accelerating data-and compute-intensive applications. It has been driven by the introduction of C-based programming environments such as NVIDIA’s CUDA, OpenCL, and Intel’s Ct. While significant effort has been focused on developing and evaluating applications and software tools, comparatively little has been devoted to the analysis and characterization of applications to assist future work in compiler optimizations, application re-structuring, and micro-architecture design.
This paper proposes a set of metrics for GPU workloads and uses these metrics to analyze the behavior of GPU programs. We report on an analysis of over 50 kernels and applications including the full NVIDIA CUDA SDK and UIUC’s Parboil Benchmark Suite covering control flow, data flow, parallelism, and memory behavior. The analysis was performed using a full function emulator we developed that implements the NVIDIA virtual machine referred to as PTX (Parallel Thread eXecution architecture) – a machine model and low-level virtual ISA that is representative of ISAs for data-parallel execution. The emulator can execute compiled kernels from the CUDA compiler, currently supports the full PTX 1.4 specification, and has been validated against the full CUDA SDK. The results quantify the importance of optimizations such as those for branch re-convergence, the prevalance of sharing between threads, and highlights opportunities for additional parallelism.