Modern GPUs perform floating point math and read data from off-chip memory at rates roughly five times that of a fast Pentium 4 CPU. However, the performance of algorithms for computing dense matrix-matrix products on GPUs has lagged behind that of good CPU implementations. In this paper, we show why this result is not an artifact of poorly designed algorithms, and explain how present-day graphics architectures are highly inefficient for computations such as matrix-matrix multiplication that involve significant data reuse. (Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication. Kayvon Fatahalian, Jeremy Sugerman, and Pat Hanrahan.)