Supercomputing 2009 birds-of-a-feather session on “The Art of Performance Tuning for CUDA and Manycore Architectures”

December 2nd, 2009

High throughput architectures for HPC seem likely to emphasize many cores with deep multithreading, wide SIMD, and sophisticated memory hierarchies. GPUs present one example, and their high throughput has led a number of researchers to port computationally intensive applications to NVIDIA’s CUDA architecture.

This session explored the art of performance tuning for CUDA using several case studies. Topics included profiling to identify bottlenecks, effective use of the GPU’s memory hierarchy and DRAM interface to maximize bandwidth, data versus task parallelism, and avoiding SIMD divergence.  Many of the lessons learned in the context of CUDA are likely to apply to other many-core architectures used in HPC applications.