Accelerate Your Science on the Titan Supercomputer

April 1st, 2012

Accelerate your science on the Titan Supercomputer later this year, by harnessing up to 20 petaflops of parallel processing using GPUs. Open to researchers from academia, government labs, and industry, the Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program is the major means by which the scientific community gains access to some of the fastest supercomputers.

First, let INCITE know you are interested in GPU acceleration by completing a two-minute survey. Then determine if you want to submit a formal proposal by June 27, 2012.

Need help drafting your proposal? Attend a “how-to” webinar on Tuesday, April 24 to learn tips and tricks for drafting your proposal. For further questions about the call for proposals, please contact the INCITE manager at

3 of the 5 fastest supercomputers in the world use GPUs

November 17th, 2010

The latest Top 500 list of the world’s fastest supercomputers, released November 15th, demonstrates that GPUs are being adopted on a large scale in the HPC space.  Three out of the top 5 machines (#1 and #3 in China, and #4 in Japan) feature NVIDIA Tesla GPUs.  Also, the list confirms the expected result that the new GPU-based Tianhe-1a machine from China has ousted Jaguar from the top spot.

More details at

NVIDIA Tesla GPUs Power World’s Fastest Supercomputer

October 28th, 2010

From a press release:

SANTA CLARA, CA — (Marketwire) — 10/28/2010 — Tianhe-1A, a new supercomputer revealed today at HPC 2010 China, has set a new performance record of 2.507 petaflops, as measured by the LINPACK benchmark, making it the fastest system in China and in the world today.

Tianhe-1A epitomizes modern heterogeneous computing by coupling massively parallel GPUs with multi-core CPUs, enabling significant achievements in performance, size and power. The system uses 7,168 NVIDIA® Tesla™ M2050 GPUs and 14,336 CPUs; it would require more than 50,000 CPUs and twice as much floor space to deliver the same performance using CPUs alone.
Read the rest of this entry »

GPU Supercomputer #2 in Top500

May 31st, 2010

The June 2010 Top500 list of the world’s fastest supercomputers was released this week at ISC 2010.  While the US Jaguar supercomputer (located at the Department of Energy’s Oak Ridge Leadership Computing Facility) retained the top spot in Linpack performance, a Chinese cluster called Nebulae, built from a Dawning TC3600 Blade system with Intel X5650 processors and NVIDIA Tesla C2050 GPUs is now the fastest in theoretical peak performance at 2.98 PFlop/s and No. 2 with a Linpack performance of 1.271 PFlop/s. This is the highest rank a GPU-accelerated system, or a Chinese system, has ever achieved on the Top500 list.

For more information, visit

Supercomputing 2009 birds-of-a-feather session on “The Art of Performance Tuning for CUDA and Manycore Architectures”

December 2nd, 2009

High throughput architectures for HPC seem likely to emphasize many cores with deep multithreading, wide SIMD, and sophisticated memory hierarchies. GPUs present one example, and their high throughput has led a number of researchers to port computationally intensive applications to NVIDIA’s CUDA architecture.

This session explored the art of performance tuning for CUDA using several case studies. Topics included profiling to identify bottlenecks, effective use of the GPU’s memory hierarchy and DRAM interface to maximize bandwidth, data versus task parallelism, and avoiding SIMD divergence.  Many of the lessons learned in the context of CUDA are likely to apply to other many-core architectures used in HPC applications.

Supercomputing 2009 Tutorial: High-Performance Computing with CUDA

November 30th, 2009

The presentation slides from the Supercomputing 2009 full-day tutorial “High-Performance Computing with CUDA” are now available at


NVIDIA’s CUDA is a general-purpose architecture for writing highly parallel applications. CUDA provides several key abstractions—a hierarchy of thread blocks, shared memory, and barrier synchronization—for scalable high-performance parallel computing. Scientists throughout industry and academia use CUDA to achieve dramatic speedups on production and research codes. The CUDA architecture supports many languages, programming environments, and libraries including C, Fortran, OpenCL, DirectX Compute, Python, Matlab, FFT, LAPACK, etc.

In this tutorial NVIDIA engineers will partner with academic and industrial researchers to present CUDA and discuss its advanced use for science and engineering domains. The morning session will introduce CUDA programming, motivate its use with many brief examples from different HPC domains, and discuss tools and programming environments. The afternoon will discuss advanced issues such as optimization and sophisticated algorithms/data structures, closing with real-world case studies from domain scientists using CUDA for computational biophysics, fluid dynamics, seismic imaging, and theoretical physics.

CfP: International Conference on Supercomputing (ICS’10)

November 30th, 2009

24th International Conference on Supercomputing (ICS’10)
June 1-4, 2010
Epochal Tsukuba (Tsukuba International Congress Center)
Tsukuba, Japan
Sponsored by ACM/SIGARCH

ICS is the premier international forum for the presentation of research results in high-performance computing systems.  In 2010 the conference will be held at the Epochal Tsukuba (Tsukuba International Congress Center) in Tsukuba City, the largest high-tech and academic
city in Japan.

Papers are solicited on all aspects of research, development, and application of high-performance experimental and commercial systems. Special emphasis will be given to work that leads to better understanding of the implications of the new era of million-scale parallelism and Exa-scale performance; including (but not limited to): Read the rest of this entry »

Using Many-Core Hardware to Correlate Radio Astronomy Signals

August 26th, 2009


A recent development in radio astronomy is to replace traditional dishes with many small antennas. The signals are combined to form one large, virtual telescope. The enormous data streams are cross-correlated to filter out noise. This is especially challenging, since the computational demands grow quadratically with the number of data streams. Moreover, the correlator is not only computationally intensive, but also very I/O intensive. The LOFAR telescope, for instance, will produce over 100 terabytes per day. The future SKA telescope will even require in the order of exaflops, and petabits/s of I/O. A recent trend is to correlate in software instead of dedicated hardware. This is done to increase flexibility and to reduce development efforts. Examples include e-VLBI and LOFAR.

In this paper, we evaluate the correlator algorithm on multi-core CPUs and many-core architectures, such as NVIDIA and ATI GPUs, and the Cell/B.E. The correlator is a streaming, real-time application, and is much more I/O intensive than applications that are typically implemented on many-core hardware today. We compare with the LOFAR production correlator on an IBM Blue Gene/P supercomputer. We investigate performance, power efficiency, and programmability. We identify several important architectural problems which cause architectures to perform suboptimally. Our findings are applicable to data-intensive applications in general. Read the rest of this entry »

Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

August 23rd, 2009


Sparse matrix-vector multiplication (SpMV) is of singular importance in sparse linear algebra. In contrast to the uniform regularity of dense linear algebra, sparse operations encounter a broad spectrum of matrices ranging from the regular to the highly irregular. Harnessing the tremendous potential of throughput-oriented processors for sparse operations requires that we expose substantial fine-grained parallelism and impose sufficient regularity on execution paths and memory access patterns. We explore SpMV methods that are well-suited to throughput-oriented architectures like the GPU and which exploit several common sparsity classes. The techniques we propose are efficient, successfully utilizing large percentages of peak bandwidth. Furthermore, they deliver excellent total throughput, averaging 16 GFLOP/s and 10 GFLOP/s in double precision for structured grid and unstructured mesh matrices, respectively, on a GeForce GTX 285. This is roughly 2.8 times the throughput previously achieved on Cell BE and more than 10 times that of a quad-core Intel Clovertown system.

(“Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors“. Nathan Bell and Michael Garland, in “Proc. Supercomputing ’09”, August 2009.)

Path to Petascale: Adapting GEO/CHEM/ASTRO Applications for Accelerators and Accelerator Clusters

June 4th, 2009

The goal of this workshop, held at the National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, was to help computational scientists in the geosciences, computational chemistry, and astronomy and astrophysics communities take full advantage of emerging high-performance computing resources based on computational accelerators, such as clusters with GPUs and Cell processors.

Slides are now available online and cover a wide range of topics including

  • GPU and Cell programming tutorials
  • GPU and Cell technology
  • Accelerator programming, clusters, frameworks and building blocks such as sparse matrix-vector products, tree-based algorithms and in particular accelerator integration into large-scale established code bases
  • Case studies and posters from geosciences, computational chemistry and astronomy/astrophysics such as the simulation of earthquakes, molecular dynamics, solar radiation, tsunamis, weather predictions, climate modeling and n-body systems as well as Monte-Carlo, Euler, Navier-Stokes and Lattice-Boltzmann type of simulations

(National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign: Path to Petascale workshop presentations, organized by Wen-mei Hwu, Volodymyr Kindratenko, Robert Wilhelmson, Todd Martínez and Robert Brunner)