The presentation slides from the Supercomputing 2009 full-day tutorial “High-Performance Computing with CUDA” are now available at http://gpgpu.org/sc2009.
Abstract:
NVIDIA’s CUDA is a general-purpose architecture for writing highly parallel applications. CUDA provides several key abstractions—a hierarchy of thread blocks, shared memory, and barrier synchronization—for scalable high-performance parallel …
Supercomputing 2009 Tutorial: High-Performance Computing with CUDA
November 30th, 2009Supercomputing 2009 CUDA Tutorial
November 30th, 2009High Performance Computing with CUDA
Welcome to the course notes for the full-day SUPERCOMPUTING 2009 CUDA Tutorial!
(Note: the slides below are also available on the NVIDIA website.)
Abstract
NVIDIA’s CUDA is a general-purpose architecture for writing highly parallel applications. CUDA provides several key abstractions—a hierarchy of thread blocks, shared memory, and barrier synchronization—for scalable high-performance parallel computing. Scientists throughout industry and academia use CUDA to achieve dramatic speedups on production and research codes. The CUDA architecture supports many languages, programming environments, and libraries including C, Fortran, OpenCL, DirectX Compute, Python, Matlab, FFT, LAPACK, etc.
In this tutorial NVIDIA engineers will partner with academic and industrial researchers to present CUDA and discuss its advanced use for science and engineering domains. The morning session will introduce CUDA programming, motivate its use with many brief examples from different HPC domains, and discuss tools and programming environments. The afternoon will discuss advanced issues such as optimization and sophisticated algorithms/data structures, closing with real-world case studies from domain scientists using CUDA for computational biophysics, fluid dynamics, seismic imaging, and theoretical physics.
8:30 Introduction-Overview and CUDA Basics
David Luebke, NVIDIA
[Download PDF]
9:00 CUDA Programming Environments
Ian Buck, NVIDIA
[Download PDF]
10:30 CUDA Libraries & Tools
Jonathan Cohen, NVIDIA
[Download PDF]
11:15 Optimizing GPU Performance and CPU-GPU Performance
Paulius Micikevicius, NVIDIA
[Download PDF]
1:45 Irregular Algorithms & Data Structures
John Owens, University of California Davis
[Download PDF]
2:30 Molecular Modeling
John Stone, University of Illinois at Urbana-Champaign
[Download PDF]
3:30 Seismic Imaging
Scott Morton, Hess
[Download PDF]
4:00 Computational Fluid Dynamics
Jonathan Cohen, NVIDIA
[Download PDF]
5:00 Quantum Chromodynamics
Michael Clark, Harvard University
[Download PDF]
Workshop on GPU Supercomputing 2009, National Taiwan University
February 3rd, 2009The first NTU workshop on GPU supercomputing was held at NTU on January 16, 2009. Organized by the Center for Quantum Science and Engineering (CQSE) at National Taiwan University, This workshop consisted of seminars on applications of GPU/CUDA in high performance computations in science and engineering, as well as other fields. Slides from the presentations are now online.
Beyond Programmable Shading SIGGRAPH 2009 Course
August 6th, 2009The course notes and supplementary material for “Beyond Programmable Shading”, a full-day course held at SIGGRAPH 2009 on August 6, are now available online.
This course is presented in two parts, Beyond Programmable Shading I and Beyond Programmable Shading II.
There are strong indications that the future of interactive graphics programming is a more flexible model than today’s OpenGL/Direct3D pipelines. Graphics developers need a basic understanding of how to combine emerging parallel programming techniques and more flexible graphics processors with the traditional interactive rendering pipeline. The first half of the course introduces the trends and directions in this emerging field. Topics include: parallel graphics architectures, parallel programming models for graphics, and game-developer investigations of the use of these new capabilities in future rendering engines.
The second half of the course has leaders from graphics hardware vendors, game development, and academic research present case studies that show how general parallel computation is being combined with the traditional graphics pipeline to boost image quality and spur new graphics algorithm innovation. Each case study discusses the mix of parallel programming constructs used, details of the graphics algorithm, and how the rendering pipeline and computation interact to achieve the technical goals. Read the rest of this entry »
ISC 2009 CUDA/OpenCL Tutorial Slides Posted
June 25th, 2009A tutorial on High Performance Computing with CUDA was held at the International Conference on Supercomputing in Hamburg on Monday, June 22nd 2009. The tutorial included an introduction to the CUDA programming model and C for CUDA, along with details on the CUDA Toolkit, Libraries, and optimization. The tutorial also provided an introduction to OpenCL, and finished with a case study on Computational Fluid Dynamics by Dr. Graham Pullan from Cambridge University. Slides from the tutorial are now posted here on GPGPU.org.
(Massimiliano Fatica, Timo Stich, and Graham Pullan. High Performance Computing with CUDA. Tutorial. International Conference on Supercomputing 2009. Hamburg, Germany.)
ISC 2009 CUDA Tutorial
June 25th, 2009High Performance Computing with CUDA
Welcome to the course notes for the full-day CUDA Tutorial from the 2009 International Conference on Supercomputing!
The tutorial was held at the International Conference on Supercomputing in Hamburg, Germany on Monday, June 22, 2009.
Course Organizers
Dr. Massimiliano Fatica, NVIDIA Corporation
Course Speakers
Dr. Timo Stich, NVIDIA Corporation
Dr. Graham Pullan, University of Cambridge, UK
Tutorial Slides
Triangular matrix inversion on Graphics Processing Unit
February 6th, 2010Abstract:
Dense matrix inversion is a basic procedure in many linear algebra algorithms. A computationally arduous step in most dense matrix inversion methods is the inversion of triangular matrices as produced by factorization methods such as LU decomposition. In this paper, we demonstrate how triangular matrix inversion (TMI) can be accelerated considerably by using commercial Graphics Processing Units (GPU) in a standard PC. Our implementation is based on a divide and conquer type recursive TMI algorithm, efficiently adapted to the GPU architecture. Our implementation obtains a speedup of 34x versus a CPU-based LAPACK reference routine, and runs at up to 54 gigaflops/s on a GTX 280 in double precision. Limitations of the algorithm are discussed, and strategies to cope with them are introduced. In addition, we show how inversion of an L- and U-matrix can be performed concurrently on a GTX 295 based dual-GPU system at up to 90 gigaflops/s.
(Florian Ries, Tommaso De Marco, Matteo Zivieri and Roberto Guerrieri, Triangular Matrix Inversion on Graphics Processing Units, Supercomputing 2009, DOI 10.1145/1654059.1654069)
Supercomputing 2009 birds-of-a-feather session on “The Art of Performance Tuning for CUDA and Manycore Architectures”
December 2nd, 2009High throughput architectures for HPC seem likely to emphasize many cores with deep multithreading, wide SIMD, and sophisticated memory hierarchies. GPUs present one example, and their high throughput has led a number of researchers to port computationally intensive applications to NVIDIA’s CUDA architecture.
This session explored the art of performance tuning for CUDA using several case studies. Topics included profiling to identify bottlenecks, effective use of the GPU’s memory hierarchy and DRAM interface to maximize bandwidth, data versus task parallelism, and avoiding SIMD divergence. Many of the lessons learned in the context of CUDA are likely to apply to other many-core architectures used in HPC applications.
CfP: International Conference on Supercomputing (ICS’10)
November 30th, 200924th International Conference on Supercomputing (ICS’10)
June 1-4, 2010
Epochal Tsukuba (Tsukuba International Congress Center)
Tsukuba, Japan
Sponsored by ACM/SIGARCH
ICS is the premier international forum for the presentation of research results in high-performance computing systems. In 2010 the conference will be held at the Epochal Tsukuba (Tsukuba International Congress Center) in Tsukuba City, the largest high-tech and academic
city in Japan.
Papers are solicited on all aspects of research, development, and application of high-performance experimental and commercial systems. Special emphasis will be given to work that leads to better understanding of the implications of the new era of million-scale parallelism and Exa-scale performance; including (but not limited to): Read the rest of this entry »
NVIDIA Announces Next-Generation CUDA GPU Architecture – Codenamed “Fermi”
October 1st, 2009On September 30th NVIDIA unveiled its latest GPU architecture, codenamed “Fermi”. The first Fermi GPUs will contain 512 “CUDA Cores”, capable of more than 8x the double precision floating-point throughput of its predecessor, the GT200 GPU. The GPU also incorporates error correcting (ECC) memories and caches, a new cache hierarchy, increased shared memory and register file sizes, and the ability to execute C++ programs.
From the press release:
SANTA CLARA, Calif. -Sep. 30, 2009- NVIDIA Corp. today introduced its next generation CUDA™ GPU architecture, codenamed “Fermi”. An entirely new ground-up design, the “Fermi”™ architecture is the foundation for the world’s first computational graphics processing units (GPUs), delivering breakthroughs in both graphics and GPU computing.
“NVIDIA and the Fermi team have taken a giant step towards making GPUs attractive for a broader class of programs,” said Dave Patterson, director Parallel Computing Research Laboratory, U.C. Berkeley and co-author of Computer Architecture: A Quantitative Approach. “I believe history will record Fermi as a significant milestone.”
Presented at the company’s inaugural GPU Technology Conference, in San Jose, California, “Fermi” delivers a feature set that accelerates performance on a wider array of computational applications than ever before. Joining NVIDIA’s press conference was Oak Ridge National Laboratorywho announced plans for a new supercomputer that will use NVIDIA® GPUs based on the “Fermi” architecture. “Fermi” also garnered the support of leading organizations including Bloomberg, Cray, Dell, HP, IBM and Microsoft.