Direct N-body Kernels for Multicore Platforms

January 24th, 2010

From the abstract:

We present an inter-architectural comparison of single- and double-precision direct n-body implementations on modern multicore platforms, including those based on the Intel Nehalem and AMD Barcelona systems, the Sony-Toshiba-IBM PowerXCell/8i processor, and NVIDA Tesla C870 and C1060 GPU systems. We compare our implementations across platforms on a variety of proxy measures, including performance, coding complexity, and energy efficiency.

Nitin Arora, Aashay Shringarpure, and Richard Vuduc. “Direct n-body kernels for multicore platforms.” In Proc. Int’l. Conf. Parallel Processing (ICPP), Vienna, Austria, September 2009 (direct link to PDF).

Some older publications worth reading

January 17th, 2010

Occasionally, we receive news submissions pointing us to interesting older papers that somehow slipped by without our notice. This post collects a few of those. If you want your work to be posted on  in a timely manner, please remember to use the news submission form.

  • Joshua A. Anderson, Chris D. Lorenz and Alex Travesset present and discuss molecular dynamics simulations and compare a single GPU against a 36-CPU cluster (General purpose molecular dynamics simulations fully implemented on graphics processing units, Journal of Computational Physics 227(10), May 2008, DOI 10.1016/
  • Wen-mei Hwu et al. derive and discuss goals and concepts of programming models for fine-grained parallel architectures, from the point of view of both a programmer and a hardware /compiler designer, and analyze CUDA as one current representative  (Implicitly parallel programming models for thousand-core microprocessors, Proceedings of DAC’07, June 2007, DOI 10.1145/1278480.1278669).
  • Jeremy Sugerman et al. present GRAMPS, a prototype implementation of future graphics hardware that allows pipelines to be specified as graphs in software (GRAMPS: A Programming Model for Graphics Pipelines, ACM Transactions on Graphics 28(1), January 2009, DOI 10.1145/1477926.1477930).
  • William R. Mark discusses concepts of future graphics architectures in this contribution to the 2008 ACM Queue special issue on GPUs (Future graphics architectures, ACM Queue 6(2), March/April 2008,  DOI 10.1145/1365490.1365501).
  • BSGP by Qiming Hou et al. is a new programming language for general purpose GPU computing that achieves the same efficiency as well-tuned CUDA programs but makes code much easier to read, develop and maintain (BSGP: bulk-synchronous GPU programming, ACM Siggraph 2008, August 2008, DOI 10.1145/1399504.1360618).
  • Finally, Che et al. and Garland et al. survey the field of GPU computing and discuss many different application domains. These articles are, in addition to the ones we have collected on the developer pages, recommended to GPGPU newcomers.

CUDAEASY – a GPU Accelerated Cosmological Lattice Program

December 8th, 2009


This paper presents, to the author’s knowledge, the first graphics processing unit (GPU) accelerated program that solves the evolution of interacting scalar fields in an expanding universe. We present the implementation in NVIDIA’s Compute Unified Device Architecture (CUDA) and compare the performance to other similar programs in chaotic inflation models. We report speedups between one and two orders of magnitude depending on the used hardware and software while achieving small errors in single precision. Simulations that used to last roughly one day to compute can now be done in hours and this difference is expected to increase in the future. The program has been written in the spirit of LATTICEEASY and users of the aforementioned program should find it relatively easy to start using CUDAEASY in lattice simulations. The program is available under the GNU General Public License.

The program is freely available at

(Jani Sainio. “CUDAEASY – a GPU Accelerated Cosmological Lattice Program”. submitted to Computer Physics Communications (under review). November 2009.)

Supercomputing 2009 birds-of-a-feather session on “The Art of Performance Tuning for CUDA and Manycore Architectures”

December 2nd, 2009

High throughput architectures for HPC seem likely to emphasize many cores with deep multithreading, wide SIMD, and sophisticated memory hierarchies. GPUs present one example, and their high throughput has led a number of researchers to port computationally intensive applications to NVIDIA’s CUDA architecture.

This session explored the art of performance tuning for CUDA using several case studies. Topics included profiling to identify bottlenecks, effective use of the GPU’s memory hierarchy and DRAM interface to maximize bandwidth, data versus task parallelism, and avoiding SIMD divergence.  Many of the lessons learned in the context of CUDA are likely to apply to other many-core architectures used in HPC applications.

Supercomputing 2009 Tutorial: High-Performance Computing with CUDA

November 30th, 2009

The presentation slides from the Supercomputing 2009 full-day tutorial “High-Performance Computing with CUDA” are now available at


NVIDIA’s CUDA is a general-purpose architecture for writing highly parallel applications. CUDA provides several key abstractions—a hierarchy of thread blocks, shared memory, and barrier synchronization—for scalable high-performance parallel computing. Scientists throughout industry and academia use CUDA to achieve dramatic speedups on production and research codes. The CUDA architecture supports many languages, programming environments, and libraries including C, Fortran, OpenCL, DirectX Compute, Python, Matlab, FFT, LAPACK, etc.

In this tutorial NVIDIA engineers will partner with academic and industrial researchers to present CUDA and discuss its advanced use for science and engineering domains. The morning session will introduce CUDA programming, motivate its use with many brief examples from different HPC domains, and discuss tools and programming environments. The afternoon will discuss advanced issues such as optimization and sophisticated algorithms/data structures, closing with real-world case studies from domain scientists using CUDA for computational biophysics, fluid dynamics, seismic imaging, and theoretical physics.

CheCUDA: A Checkpoint/restart Tool for CUDA Applications

November 25th, 2009

In this paper, Takizawa et al. have presented a tool named CheCUDA that is designed to checkpoint CUDA applications. As existing checkpoint/restart implementations do not support checkpointing the GPU status, CheCUDA hooks basic CUDA driver API calls in order to record the GPU status changes on the main memory. At checkpointing, CheCUDA stores the status changes in a file after copying all necessary data in the video memory to the main memory and then disabling the CUDA runtime. At restart, CheCUDA reads the file, re-initializes the CUDA runtime, and recovers the resources on GPUs so as to restart from the stored status. This paper demonstrates that a prototype implementation of CheCUDA can correctly checkpoint and restart a CUDA application written with basic APIs. This also indicates that CheCUDA can migrate a process from one PC to another even if the process uses a GPU. Accordingly, CheCUDA is useful not only to enhance the dependability of CUDA applications but also to enable dynamic task scheduling of CUDA applications required especially on heterogeneous GPU cluster systems. This paper also shows the timing overhead for checkpointing.

(Hiroyuki Takizawa, Katuto Sato, Kazuhiko Komatsu, and Hiroaki Kobayashi, CheCUDA: A Checkpoint/Restart Tool for CUDA Applications, to appear inProceedings of the Tenth International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT) 2009, Workshop on Ultra Performance and Dependable Acceleration Systems).

GPULib v1.2.2 released

November 25th, 2009

GPULib provides a library of mathematical functions that facilitate the use of high performance computing resources available on modern graphics processing units (GPUs) by engineers, scientists, analysts, and other technical professionals with minimal modification to their existing programs. This software library executes vectorized mathematical functions on graphics processing units (GPUs) from NVIDIA, bringing high-performance numerical operations to everyday desktop computers. By providing bindings for a number of Very High Level Languages (VHLLs) including MATLAB and IDL from ITT Visual Information Solutions, GPULib can accelerate new applications or be incorporated into existing applications with minimal effort. No knowledge of GPU programming and memory management is required. For more information regarding GPULib, please visit

PyCUDA: GPU Run-Time Code Generation for High-Performance Computing

November 25th, 2009


High-performance scientific computing has recently seen a surge of interest in heterogeneous systems, with an emphasis on modern Graphics Processing Units (GPUs). These devices offer tremendous potential for performance and efficiency in important large-scale applications of computational science. However, exploiting this potential can be challenging, as one must adapt to the specialized and rapidly evolving computing environment currently exhibited by GPUs. One way of addressing this challenge is to embrace better techniques and develop tools tailored to their needs. This article presents one simple technique, GPU run-time code generation (RTCG), and PyCUDA, an open-source toolkit that supports this technique.
In introducing PyCUDA, this article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems. It is further observed that, compared to competing techniques, the effort required to create codes using run-time code generation with PyCUDA grows more gently in response to growing needs. The concept of RTCG is simple and easily implemented using existing, robust tools. Nonetheless it is powerful enough to support (and encourage) the creation of custom application-specific tools by its users. The premise of the paper is illustrated by a wide range of examples where the technique has been applied with considerable success.

Preprint at arXiv

(Andreas Klöckner, Nicolas Pinto, Yunsup Lee, Bryan Catanzaro, Paul Ivanov, Ahmed Fasih. PyCUDA: GPU Run-Time Code Generation for High-Performance Computing, submitted.

NVIDIA Tesla GPUs to Communicate Faster Over Mellanox InfiniBand Networks

November 25th, 2009

From a press release:

New Software Solution Reduces Dependency on CPUs

PORTLAND, Ore.- SC09-Nov. 18, 2009- NVIDIA Corporation (Nasdaq: NVDA) and Mellanox Technologies Ltd. today introduced new software that will increase cluster application performance by as much as 30% by reducing the latency that occurs when communicating over Mellanox InfiniBand to servers equipped with NVIDIA Tesla™ GPUs.

The system architecture of a GPU-CPU server requires the CPU to initiate and manage memory transfers between the GPU and the InfiniBand network. The new software solution will enable Tesla GPUs to transfer data to pinned system memory that a Mellanox InfiniBand solution is able to read and transmit over the network. The result is increased overall system performance and efficiency.

“NVIDIA Tesla GPUs deliver large increases in performance across each node in a cluster, but in our production runs on TSUBAME 1 we have found that network communication becomes a bottleneck when using multiple GPUs,” said Prof. Satoshi Matsuoka from Tokyo Institute of Technology. “Reducing the dependency on the CPU by using InfiniBand will deliver a major boost in performance in high performance GPU clusters, thanks to the work of NVIDIA and Mellanox, and will further enhance the architectural advances we will make in TSUBAME2.0.” Read the rest of this entry »

PGI CUDA Fortran Now Available from The Portland Group

November 24th, 2009

The Portland Group has announced the general availability of its CUDA Fortran compiler for x64 and x86 processor-based systems running Linux, Mac OS X and Windows, including a 15-day trial license. From the press release:

Developed in collaboration with NVIDIA Corporation (Nasdaq: NVDA), the inventor of the graphics processing unit (GPU), PGI Release 2010 includes the first Fortran compiler compatible with the NVIDIA line of CUDA-enabled GPUs. A compiler is a software tool that translates applications from the high-level programming languages in which they are written by software developers into a binary form a computer can execute.

With developers taking advantage of the hundreds of cores and the relatively low cost of NVIDIA GPUs, programming to take advantage of the CUDA C compiler has become a popular means for accelerating the solution of complex computing problems. The PGI CUDA Fortran compiler is expected to accelerate GPU adoption even further in the High-Performance Computing (HPC) industry, where many important applications are written in Fortran. HPC is the field of technical computing engaged in the modeling and simulation of complex processes, such as ocean modeling, weather forecasting, environmental modeling, seismic analysis, bioinformatics and other areas.

The CUDA Fortran compiler is compatible with all NVIDIA GPUs that include Compute Capability 1.3 or higher, which includes most NVIDIA Quadro Professional Graphics solutions and all NVIDIA Tesla GPU Computing solutions. Developers are invited to download the PGI CUDA Fortran compiler from The Portland Group website at

A 15-day trial license is available at no charge. In an effort to simplify adoption, NVIDIA has granted PGI rights to redistribute the relevant components of the CUDA Software Development Kit (SDK) as part of the PGI CUDA Fortran installation package.

Page 26 of 35« First...1020...2425262728...Last »