This new report covers all the performance improvements in the latest CUDA Toolkit 3.2 release, and compares CUDA parallel math library performance vs. commonly used CPU libraries.
Learn about the performance advantages of using the CUDA parallel math libraries for FFT, BLAS, sparse matrix operations, and random number generation.
We implemented a GPU based parallel code to perform Monte Carlo simulations of the two dimensional q-state Potts model. The algorithm is based on a checkerboard update scheme and assigns independent random number generators to each thread (one thread per spin). The implementation allows to simulate systems up to ~10^9 spins with an average time per spin flip of 0.147ns on the fastest GPU card tested, representing a speedup up to 155x, compared with an optimized serial code running on a standard CPU. The possibility of performing high speed simulations at large enough system sizes allowed us to provide a positive numerical evidence about the existence of metastability on very large systems based on Binder’s criterion, namely, on the existence or not of specific heat singularities at spinodal temperatures different of the transition one.
(Ezequiel E. Ferrero, Juan Pablo De Francesco, Nicolás Wolovick and Sergio A. Cannas: “q-state Potts model metastability study using optimized GPU-based Monte Carlo algorithms”. [arXiv:1101.0876] [code and additional information])
This meeting is organized by Toby Breckon & Stuart Barnes (Cranfield University) and the British Machine Vision Association and Society for Pattern Recognition. It will be held in London, UK, on 18 May 2011. The CfP poster is available at http://www.cranfield.ac.uk/~toby.breckon/events/bmva_symp_gpu11.pdf.
Read the rest of this entry »
Tina’s Random Number Generator Library (TRNG) version 4.11 has been released. TRNG is a state of the art open-source C++ pseudo-random number generator library for sequential and parallel Monte Carlo simulations. Its design principles are based on a proposal for an extensible random number generator facility that will be part of the forthcoming revision of the ISO C++ standard. The TRNG library features an object oriented design, is easy to use and has been speed optimized. Its implementation does not depend on any communication library or hardware architecture. TRNG is suited for shared memory as well as for distributed memory computers and may be used in various parallel programming environments, e.g. Message Passing Interface Standard or OpenMP. As an outstanding new feature of the latest TRNG release 4.11 it also supports CUDA. All generators that are implemented by TRNG have been subjected to thorough statistical tests in sequential and parallel setups. Download and further information: http://trng.berlios.de/
The fourth International workshop and tutorial on Computational Intelligence on Consumer Games and Graphics Hardware (CIGPU 2011) will be held as a workshop in the GECCO-2011 conference in Dublin 12-16 July 2011. Submissions are invited in (but not limited to) the following areas:
- Parallel genetic programming (GP) on GPU
- Parallel genetic algorithms (GA) on GPU
- Parallel evolutionary programming (EP) on GPU
- Associated or hybrid computational intelligence techniques on GPU
- Particle Swarm Optimisation (PSO)
- Ant colony
- Parallel search algorithms
- Data mining
- Differential Evolution on GPU
- Computational Biology or Bioinformatics on GPU
- Evolutionary computation on video game platforms
- Evolutionary computation on mobile devices
See: http://www.sigevo.org/gecco-2011/workshops.html#cigpu and http://www.cs.ucl.ac.uk/staff/W.Langdon/cigpu/ for more information.
Although trivial background subtraction (BGS) algorithms (e.g. frame differencing, running average…) can perform quite fast, they are not robust enough to be used in various computer vision problems. Some complex algorithms usually give better results, but are too slow to be applied to real-time systems. We propose an improved version of the Extended Gaussian mixture model that utilizes the computational power of Graphics Processing Units (GPUs) to achieve real-time performance. Experiments show that our implementation running on a low-end GeForce 9600GT GPU provides at least 10x speedup. The frame rate is greater than 50 frames per second (fps) for most of the tests, even on HD video formats.
(Vu Pham, Phong Vo, Vu Thanh Hung and Le Hoai Bac: “GPU Implementation of Extended Gaussian Mixture Model for Background Subtraction”. IEEE International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), 2010. [DOI] [code and additional information])
Almost all the presentations from the recent UK GPU Computing Conference held on December 13-14 2010 in Cambridge are now available at http://www.many-core.group.cam.ac.uk/ukgpucc2/programme.shtml. Over 100 delegates saw a varied mix of talks from both industry and academia over the 2 day meeting.
We present a fast GPU-based streaming algorithm to perform collision queries between deformable models. Our approach is based on hierarchical culling and reduces the computation to generating different streams. We present a novel stream registration method to compact the streams and efficiently compute the potentially colliding pairs of primitives. We also use a deferred front tracking method to lower the memory overhead. The overall algorithm has been implemented on different GPUs and we have evaluated its performance on non-rigid and deformable simulations. We highlight our speedups over prior GPU-based and CPU-based algorithms. In practice, our algorithm can perform inter-object and intra-object computations on models composed of hundreds of thousands of triangles in tens of milliseconds.
(Min Tang, Dinesh Manocha, Jiang Lin, Ruofeng Tong, Collision-Streams: “Fast GPU-based Collision Detection for Deformable Models”, in Proceedings of ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (i3D 2011), San Fransisco, CA, Feb. 18-20, 2011. http://gamma.cs.unc.edu/CSTREAMS)
The MOSIX group announces the release of the MOSIX Virtual OpenCL (VCL) cluster platform version 1.0, which allows OpenCL applications to transparently utilize many GPU devices in clusters. In the VCL run-time environment, all the cluster devices are seen as if they are located in each hosting-node. Applications need not be aware which nodes and devices are available and where the devices are located. VCL benefits OpenCL applications that can use multiple devices concurrently. Read the rest of this entry »
The CPU has traditionally been the computational work horse in scientific computing, but we have seen a tremendous increase in the use of accelerators, such as Graphics Processing Units (GPUs), in the last decade. These architectures are used because they consume less power and offer higher performance than equivalent CPU solutions. They are typically also far less expensive, as more CPUs, and even clusters, are required to match their performance. Even though these accelerators are powerful in terms of floating point operations per second, they are considerably more primitive in terms of capabilities. For example, they cannot even open a file on disk without the use of the CPU. Thus, most applications can benefit from using accelerators to perform heavy computation, whilst running complex tasks on the CPU. This use of different compute resources is often referred to as heterogeneous computing, and we explore the use of heterogeneous architectures for scientific computing in this thesis. Through six papers, we present qualitative and quantitative comparisons of different heterogeneous architectures, the use of GPUs to accelerate linear algebra operations in MATLAB, and efficient shallow water simulation on GPUs. Our results show that the use of heterogeneous architectures can give large performance gains.
(André R. Brodtkorb, “Scientific Computing on Heterogeneous Architectures”, Ph.D. thesis, University of Oslo, Faculty of Mathematics and Natural Sciences, 2010, (PDF))