“Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU”

October 27th, 2010

Abstract:

In this work, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. The algorithm processes individual images represented as 2-dimensional matrices of single-precision floating-point values using operations involving dot-products and additions. We implement this algorithm on a NVIDIA Fermi GPU (Tesla 2050) using CUDA, and also manually parallelize it for the Intel Xeon X5680 (Westmere) and IBM Power7 multi-core processors. Pthreads and OpenMP with SSE and VSX vector intrinsics are used for the manually parallelized version on the multi-core CPUs. A number of optimizations were performed for the GPU implementation on the Fermi, including blocking for Fermi’s configurable on-chip memory architecture. Experimental results illustrate that on a single multi-core processor, the manually parallelized versions of the correlation application perform only a small order of factor slower than the CUDA version executing on the Fermi – 1.005s on Power7, 3.49s on Intel X5680, and 465ms on Fermi. On a two-processor Power7 system, performance approaches that of the Fermi (650ms), while the Intel version runs in 1.78s. These results conclusively demonstrate that performance of the GPU memory subsystem is critical to effectively harness its computational capabilities. For the correlation application, a significantly higher amount of effort was put into developing the GPU version when compared to the CPU ones (several days against few hours). Our experience presents compelling evidence that performance comparable to that of GPUs can be achieved with much greater productivity on modern multi-core CPUs

(R. Bordawekar and U. Bondhugula and R. Rao: “Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU”, Technical Report, IBM T. J. Watson Research Center, 2010 [PDF])

.

7 Responses to ““Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU””

  1. asdfasdf says:

    lol, it’s theoretically impossible for CPUs to match the GPUs for MANY applications BUT not all.

    Yesterday i spent 1 hour writing an application in CUDA. The performance i got was ~700 GFLOPS. My Core i7 950 could never come close to matching this performance. So for HPC CPUs generally don’t stand a chance.

    • Jallo Bouchebaba says:

      Even CuBLAS DGEMM from nVidia (expected to be highly tuned) reaches only 50% of GPU peak, while DGEMM on CPUs reaches 85 to 95% of machine peak. Ever wonder why? So the fact they could only get that application perform at 12% of the peak says something about the architecture…

  2. noramlized says:

    Yeah, also note that their application only managed to utilize 12% of theoretical performance of the GPU. This report is cherry picked joke…

    • Jallo Bouchebaba says:

      Even CuBLAS DGEMM from nVidia (expected to be highly tuned) reaches only 50% of GPU peak, while DGEMM on CPUs reaches 85 to 95% of machine peak. Ever wonder why? So the fact they could only get that application perform at 12% of the peak says something about the architecture…

  3. A.K. says:

    It seems to me that the most interesting part of the paper is a table showing sustained performance. After several weeks of development the authors of the paper managed to reach only 12% of Fermi GPU utilization. As expected from IBM employees, the corresponding figure for IBM chip is much higher.
    Also, why the authors didn’t use any of the available matrix libraries for Fermi ? It surely would have shortened the development time.

  4. Vadim says:

    Anyone noticed a bad joke in the algorithm itself? If you wanted to compute correlation of two images, the way to do it is via FFT. Do the FFT on each image, multiply them in the frequency domain, and FFT back. That’s it. This has the complexity of O(n^2*log(n^2)) i.e. is probably a million times faster than the brute force approach!

  5. newcomer says:

    It reminds me about the other title that appeared here:
    http://gpgpu.org/2010/07/04/debunking-the-100x-myth

    This time it is from IBM, that time it was from Intel.

    I would suggest the authors to run their efficient correlation algorithm, and find out why Intel and IBM are so eagerly demystifying GPGPU performance.

    (hint: see HPC market share)

Leave a Comment