The main contribution of this thesis is to demonstrate that graphics processors (GPUs) as representatives of emerging many-core architectures are very well-suited for the fast and accurate solution of large sparse linear systems of equations, using parallel multigrid methods on heterogeneous compute clusters. Such systems arise for instance in the discretisation of (elliptic) partial differential equations with finite elements. We report on at least one order of magnitude speedup over highly-tuned conventional CPU implementations, without sacrificing either accuracy m
or functionality. In more detail, this thesis includes the following contributions: Read the rest of this entry »
AMD is offering an introductory tutorial to OpenCL™ that will be held alongside the 2010 Symposium on Application Accelerators in High Performance Computing (SAAHPC’10). The tutorial is a “programmer’s introduction” which covers the ideas behind OpenCL™ and their translation to source code. Read the rest of this entry »
In this work, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. The algorithm processes individual images represented as 2-dimensional matrices of single-precision floating-point values using O(n^4) operations involving dot-products and additions. We implement this algorithm on a nVidia GTX 285 GPU using CUDA, and also parallelize it for the Intel Xeon (Nehalem) and IBM Power7 processors, using both manual and automatic techniques. Pthreads and OpenMP with SSE and VSX vector intrinsics are used for the manually parallelized version, while a state-of-the-art optimization framework based on the polyhedral model is used for automatic compiler parallelization and optimization. The performance of this algorithm on the nVidia GPU suffers from: (1) a smaller shared memory, (2) unaligned device memory access patterns, (3) expensive atomic operations, and (4) weaker single-thread performance. On commodity multi-core processors, the application dataset is small enough to fit in caches, and when parallelized using a combination of task and short-vector data parallelism (via SSE/VSX) or through fully automatic optimization from the compiler, the application matches or beats the performance of the GPU version. The primary reasons for better multi-core performance include larger and faster caches, higher clock frequency, higher on-chip memory bandwidth, and better compiler optimization and support for parallelization. The best performing versions on the Power7, Nehalem, and GTX 285 run in 1.02s, 1.82s, and 1.75s, respectively. These results conclusively demonstrate that, under certain conditions, it is possible for a FLOP-intensive structured application running on a multi-core processor to match or even beat the performance of an equivalent GPU version.
(Rajesh Bordawekar and Uday Bondhugula and Ravi Rao: “Believe It or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!”. Technical Report RC24982, IBM Thomas J. Watson Research Center, Apr. 2010.)
Advanced Micro Devices (AMD) recently released ATI Stream Profiler version 1.3. ATI Stream Profiler is a Microsoft® Visual Studio® integrated runtime profiler that gathers performance data from the GPU as your OpenCL™ application runs. This information can then be used by developers to discover where the bottlenecks are in their OpenCL™ application and find ways to optimize their application’s performance.
Features of the tool include:
Measure the execution time of an OpenCL kernel
Query the hardware performance counters on ATI Radeon graphics cards
Display the memory traffic from and to GPU
Compare multiple runs (sessions) of the same or different programs
Store the profile data for each run in a csv file
Display the IL and ISA (hardware disassembly) code of the OpenCL kernel
Abstracts due…24 September 2010
Papers due…1 October 2010
Anchorage, home to moose, bears, birds and whales, is strategically located at almost equal flying distance from Europe, Asia and the Eastern USA. Embraced by six mountain ranges, with views of Mount McKinley in Denali National Park, and warmed by a maritime climate, the area offers year-round adventure, recreation, and sporting events. It is a fitting destination for IPDPS to mark a quarter century of tracking developments in computer science. IPDPS serves as a forum for engineers and scientists from around the world to present their latest research findings in the fields of parallel processing and distributed computing. The five-day program will follow the usual format of contributed papers, invited speakers, and panels mid week, framed by workshops held on the first and last days. To celebrate the 25th year of IPDPS, plan to come early and stay late and also enjoy a modern city surrounded by spectacular wilderness. For updates on IPDPS 2011, visit the Web at www.ipdps.org.
The GPU Technology Conference (GTC 2010) will be held Sept. 20-23, 2010 in San Jose, Calif. Developers, researchers, scientists and entrepreneurs are invited to submit proposals on GPU-related topics. See www.nvidia.com/gtc.
GPU Developers Summit: Session Topics deadline: June 1, 2010
Emerging Companies Summit: “CEO on Stage” Nominations deadline: August 1, 2010
NVIDIA Research Summit: Posters deadline: August 15, 2010
To submit a proposal, you will be asked to set up a GTC 2010 account so you can track the status of your submission.
Random numbers are extensively used on the GPU. As more computation is ported to the GPU, it can no longer be treated as rendering hardware alone. Random number generators (RNG) are expected to cater general purpose and graphics applications alike. Such diversity adds to expected requirements of a RNG. A good GPU RNG should be able to provide repeatability, random access, multiple independent streams, speed, and random numbers free from detectable statistical bias. A specific application may require some if not all of the above characteristics at one time. In particular, we hypothesize that not all algorithms need the highest-quality random numbers, so a good GPU RNG should provide a speed quality tradeoff that can be tuned for fast low quality or slower high quality random numbers.
We propose that the Tiny Encryption Algorithm satisfies all of the requirements of a good GPU Pseudo Random Number Generator. We compare our technique against previous approaches, and present an evaluation using standard randomness test suites as well as Perlin noise and a Monte-Carlo shadow algorithm. We show that the quality of random number generation directly affects the quality of the noise produced, however, good quality noise can still be produced with a lower quality random number generator.
(Fahad Zafar, Aaron Curtis and Marc Olano, “GPU Random Numbers via the Tiny Encryption Algorithm”, HPG 2010: Proceedings of the ACM SIGGRAPH/Eurographics Symposium on High Performance Graphics, (Saarbrücken, Germany, June 2010. Link to preprint.)
HOOMD-blue stands for Highly Optimized Object-oriented Many-particle Dynamics — Blue Edition. It performs general-purpose particle dynamics simulations on a single workstation, taking advantage of NVIDIA GPUs to attain a level of performance equivalent to dozens of processor cores on a fast cluster.
HOOMD-blue 0.9.0 is a major new release. Highlights include:
Support for Fermi generation GPUs
Performance enhancements
New pair potentials
Particle data is now accessible from hoomd scripts
Binary format dump files for simulation restarts
Numerous small enhancements to enable easily restartable jobs
2D simulations are now possible
Integration methods can now be applied to specified groups of particles
We consider three high-resolution schemes for computing shallow-water waves as described by the Saint-Venant system and discuss how to develop highly efficient implementations using graphical processing units (GPUs). The schemes are well-balanced for lake-at-rest problems, handle dry states, and support linear friction models. The first two schemes handle dry states by switching variables in the reconstruction step, so that that bilinear reconstructions are computed using physical variables for small water depths and conserved variables elsewhere. In the third scheme, reconstructed slopes are modified in cells containing dry zones to ensure non-negative values at integration points. We discuss how single and double-precision arithmetics affect accuracy and efficiency, scalability and resource utilization for our implementations, and demonstrate that all three schemes map very well to current GPU hardware. We have also implemented direct and close-to-photo-realistic visualization of simulation results on the GPU, giving visual simulations with interactive speeds for reasonably-sized grids.
(A. R. Brodtkorb, T. R. Hagen, K.-A. Lie and J. R. Natvig: “Simulation and Visualization of the Saint-Venant System using GPUs”. In review, February 2010. Link to PDF preprint, Youtube video)
Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs).We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.
(A. R. Brodtkorb, C. Dyken, T. R. Hagen, J. M. Hjelmervik and O. O. Storaasli: “State-of-the-Art in Heterogeneous Computing”, IOS Press, 18(1) (2010), pp. 1-33. Link to PDF)
GPGPU stands for General-Purpose computation on Graphics Processing Units. Graphics Processing Units (GPUs) are high-performance many-core processors that can be used to accelerate a wide range of applications. GPGPU.org is a central resource for GPGPU news and information. Learn more.