“Believe it or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!”

May 30th, 2010

Abstract:

In this work, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. The algorithm processes individual images represented as 2-dimensional matrices of single-precision floating-point values using O(n^4) operations involving dot-products and additions. We implement this algorithm on a nVidia GTX 285 GPU using CUDA, and also parallelize it for the Intel Xeon (Nehalem) and IBM Power7 processors, using both manual and automatic techniques. Pthreads and OpenMP with SSE and VSX vector intrinsics are used for the manually parallelized version, while a state-of-the-art optimization framework based on the polyhedral model is used for automatic compiler parallelization and optimization. The performance of this algorithm on the nVidia GPU suffers from: (1) a smaller shared memory, (2) unaligned device memory access patterns, (3) expensive atomic operations, and (4) weaker single-thread performance. On commodity multi-core processors, the application dataset is small enough to fit in caches, and when parallelized using a combination of task and short-vector data parallelism (via SSE/VSX) or through fully automatic optimization from the compiler, the application matches or beats the performance of the GPU version. The primary reasons for better multi-core performance include larger and faster caches, higher clock frequency, higher on-chip memory bandwidth, and better compiler optimization and support for parallelization. The best performing versions on the Power7, Nehalem, and GTX 285 run in 1.02s, 1.82s, and 1.75s, respectively. These results conclusively demonstrate that, under certain conditions, it is possible for a FLOP-intensive structured application running on a multi-core processor to match or even beat the performance of an equivalent GPU version.

(Rajesh Bordawekar and Uday Bondhugula and Ravi Rao: “Believe It or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!”. Technical Report RC24982, IBM Thomas J. Watson Research Center, Apr. 2010.)

  • Pingback: Tweets that mention “Believe it or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!” :: GPGPU.org -- Topsy.com()

  • Usman

    Although the parameters of this test are not fully described, the experiment is not fair enough to provide acceptable conclusions. The latest GPU for computing is Fermi Architecture based Nvidia Tesla C2050 which performs many times faster than Geforce GTX 285, especially in double-precision tests.

  • Well, my opinion is that this experiment is biased towards CPU implementation. Here are the reasons:

    1. GPU most effectively operates with data which dimensions are power of two. Thus test image size should be power of two (512×512 dimensions and not 500×500 as in experiment). Also CORR_SIZE should be power of two.

    2. Authors did not use most efficient cross-correlation algorithm involving Fast Fourier Transform (FFT). Nvidia specifically implemented fast FFT module for CUDA-
    http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/CUFFT_Library_3.0.pdf
    So it can be that cross-correlation implemented with the help of FFT *may* outperform CPU versions.

    3. Finally,- I don’t think that cross-correlation is suitable for parallelization testing purposes. Simply because it may be hardly parallelizable. For example- one extreme case of hardly parallelizable algorithm may be calculating fibbonachi sequence until N’th order. So it is natural to expect that in hardly-parallelizable algorithm GPU behaves like 1 (more or less) core CPU.
    So this experiment may just show this obvious thing.

    • Rajesh Bordawekar

      The point was not to device the best correlation algorithm, but to optimize a given algorithmic implementation on the GPU and CPUs. While FFT is another way to do correlation, there were reasons for not using the FFT (the kernel was extracted from a real-life computational biology application). The GPU numbers in the original TR have now been improved to 1.22 sec, so it outperforms the Intel numbers (so, this paper is NOT CPU-biased).

  • Devil

    Is there any money involved in this dispute? …I could swear my grandpa outperforms Usain Bolt in a 35 Km run (he is tied to an electric wheelchair)…
    Come on guys we’d rather be serious and scientific…

    • Jallo Bouchebaba

      Not sure if you are referring to the comments above in general or commenting on the paper itself.