Molecular dynamics (MD) methods compute the trajectory of a system of point particles in response to a potential function by numerically integrating Newton’s equations of motion. Extending these basic methods with rigid body constraints enables composite particles with complex shapes such as anisotropic nanoparticles, grains, molecules, and rigid proteins to be modeled. Rigid body constraints are added to the GPU-accelerated MD package, HOOMD-blue, version 0.10.0. The software can now simulate systems of particles, rigid bodies, or mixed systems in microcanonical (NVE), canonical (NVT), and isothermalisobaric (NPT) ensembles. It can also apply the FIRE energy minimization technique to these systems. In this paper, we detail the massively parallel scheme that implements these algorithms and discuss how our design is tuned for the maximum possible performance. Two different case studies are included to demonstrate the performance attained, patchy spheres and tethered nanorods. In typical cases, HOOMD-blue on a single GTX 480 executes 2.5–3.6 times faster than LAMMPS executing the same simulation on any number of CPU cores in parallel. Simulations with rigid bodies may now be run with larger systems and for longer time scales on a single workstation than was previously even possible on large clusters.
(Trung Dac Nguyen, Carolyn L. Phillips, Joshua A. Anderson, and Sharon C. Glotzer: “Rigid body constraints realized in massively-parallel molecular dynamics on graphics processing units”, Computer Physics Communications 182(11):2307–2313, November 2011. [DOI])
Brownian Dynamics (BD), also known as Langevin Dynamics, and Dissipative Particle Dynamics (DPD) are implicit solvent methods commonly used in models of soft matter and biomolecular systems. The interaction of the numerous solvent particles with larger particles is coarse-grained as a Langevin thermostat is applied to individual particles or to particle pairs. The Langevin thermostat requires a pseudo-random number generator (PRNG) to generate the stochastic force applied to each particle or pair of neighboring particles during each time step in the integration of Newton’s equations of motion. In a Single-Instruction-Multiple-Thread (SIMT) GPU parallel computing environment, small batches of random numbers must be generated over thousands of threads and millions of kernel calls. In this communication we introduce a one-PRNG-per-kernel-call-per-thread scheme, in which a micro-stream of pseudorandom numbers is generated in each thread and kernel call. These high quality, statistically robust micro-streams require no global memory for state storage, are more computationally efficient than other PRNG schemes in memory-bound kernels, and uniquely enable the DPD simulation method without requiring communication between threads.
(Carolyn L. Phillips, Joshua A. Anderson and Sharon C. Glotzer: “Dynamics and Dissipative Particle Dynamics simulations on GPU devices”, Journal of Computational Physics 230(19):7191-7201, August 2011. [DOI])
GPUs are excellent accelerators for data parallel applications with regular data access patterns. It is challenging, however, to optimize computations with irregular data access patterns on GPUs. One such computation is the Symmetric Matrix Vector product (SYMV) for dense linear algebra. Optimizing the SYMV kernel is important because it forms the basis of fundamental algorithms such as linear solvers and eigenvalue solvers on symmetric matrices. In this work, we present a new algorithm for optimizing the SYMV kernel on GPUs. Our optimized SYMV in single precision brings up to a 7x speed up compared to the (latest) CUBLAS 4.0 NVIDIA library on the GTX 280 GPU. Our SYMV kernel tuned for Fermi C2050 is 4.5x faster than CUBLAS 4.0 in single precision and 2x faster than CUBLAS 4.0 in double precision. Moreover, the techniques used and described in the paper are general enough to be of interest for developing high-performance GPU kernels beyond the particular case of SYMV.
(R. Nath, S. Tomov, T. Dong, and J. Dongarra, “Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs”, accepted for SC’11. [WWW] [PDF])
We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPU-GPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divide-and-conquer algorithm that performs a fast N-body sum using a spatial decomposition and is often used in a time-stepping or iterative loop. Using the observation that the local summation and the analysis-based translation parts of the FMM are independent, we map these respectively to the GPUs and CPUs. Careful analysis of the FMM is performed to distribute work optimally between the multicore CPUs and the GPU accelerators. We first develop a single node version where the CPU part is parallelized using OpenMP and the GPU version via CUDA. New parallel algorithms for creating FMM data structures are presented together with load balancing strategies for the single node and distributed multiple-node versions. Our 8 GPU performance
is comparable with performance of a 256 GPU version of the FMM that won the 2009 Bell prize.
(Qi Hu, Nail A. Gumerov and Ramani Duraswami: “Scalable fast multipole methods on distributed heterogeneous architectures”, accepted for SC’11. [PDF])
It is increasingly easy to develop software that exploits Graphics Processing Units (GPUs). The molecular dynamics simulation community has embraced this recent opportunity. Herein, we outline the current approaches that exploit this technology. In the context of biomolecular simulations, we discuss some of the algorithms that have been implemented and some of the aspects that distinguish the GPU from previous parallel environments. The ubiquity of GPUs and the ingenuity of the simulation community augur well for the scale and scope of future computational studies of biomolecules.
(Baker, J. A. and Hirst, J. D.: “Molecular Dynamics Simulations Using Graphics Processing Units”. Molecular Informatics, 30:498–504. [DOI])
CUDPP release 2.0 is a major new release of the CUDA Data-Parallel Primitives Library, with exciting new features. The public interface has undergone a minor redesign to provide thread safety. Parallel reductions (cudppReduce) and a tridiagonal system solver (cudppTridiagonal) have been added, and a new component library, cudpp_hash, provides fast data-parallel hash table functionality. In addition, support for 64-bit data types (double as well as long long and unsigned long long) has been added to all CUDPP algorithms, and a variety of bugs have been fixed. For a complete list of changes, see the change log. CUDPP 2.0 is available for download now.
A special session on the use of heterogeneous computing for water resources will be held as part of The XIX International Conference on Computational Methods in Water Resources, July 17-21 2012 at the University of Illinois at Urbana-Champaign. Submissions are due October 1st. Topics include, but are not limited to
- novel applications of heterogeneous computing resources,
- computational efficiency and performance assessment, and
- accuracy, verification and validation.
This session is focused on the use of heterogeneous computing resources (i.e. the combination of multi-core CPUs and many-core GPUs) for water resources. Over the last ten years, the use of GPUs for computation has gone from academic proof-of-concepts to industrially viable applications, showing speed-ups of 5-50 times over traditional approaches. Speed is of the utmost importance for many applications in water resources, making the use of heterogeneous computing attractive. In this session, we seek presentations of the state-of-the-art of heterogeneous computing for applications in water resources. Topics of interest include, but are not limited to, novel applications of heterogeneous computing resources; computational efficiency and performance assessment; and accuracy, and verification and validation.
Odeint is a high level C++ library for solving ordinary differential equations. It is released under an open-source license and supports a variety of different methods for solving ODEs. As a special feature it supports different algebras which perform the basic mathematical operations. This allows the user to solve ordinary differential equations on modern graphic cards. A Thrust interface is implemented, so that the power of CUDA can easily be employed. Furthermore, arbitrary precision types can easily be supported. Read the rest of this entry »
EvoPar 2012 (EvoPAR 2012, Malaga, Spain, 11-13th April 2012) will gather scientists, engineers and practitioners to share and exchange their experiences, discuss challenges and report state-of-the-art and in-progress research on all aspects of the application of evolutionary algorithms for improving parallel architectures and distributed computing infrastructures and implementation of parallel and distributed evolutionary algorithms.
Submissions are invited (by Nov. 30) on (but not limited to) the following topics:
- Optimization of parallel architectures by means of Evolutionary Algorithms.
- Hardware implementation of EAs, including Field Programmable Gate Arrays (FPGA), GPU, games consoles, mobile devices.
- GPGPU optimisation (CUDA, AMD, ARM, OpenCL, etc., etc.). Read the rest of this entry »
Implementing flexible software solutions, such as rendering and ray tracing, is still challenging for GPU programs. The amount of available memory on modern GPUs is relatively small. Scenes for feature film rendering and visualization have large geometric complexity and can easily contain millions of polygons and a large number of texture maps and other data attributes. CentiLeo presents an interactive out-of-core ray tracing engine running on the single desktop GPU. The system is built around a virtual memory manager. A novel ray intersection algorithm built around an acceleration structure, cached on the GPU, loads needed data on-demand using page swapping. The ray tracing engine is used to implement a variety of rendering and light transport algorithms. The system is implemented using CUDA and runs on a single NVIDIA GTX 480.
Read the rest of this entry »