General-purpose multiprocessors (as, in our case, Intel IvyBridge and Intel Haswell) increasingly add GPU computing power to the former multicore architectures. When used for embedded applications (for us, Synthetic aperture radar) with intensive signal processing requirements, they must constantly compute convolution algorithms, such as the famous Fast Fourier Transform. Due to its ”fractal” nature (the typical butterfly shape, with larger FFTs defined as combination of smaller ones with auxiliary data array transpose functions), one can hope to compute analytically the size of the largest FFT that can be performed locally on
an elementary GPU compute block. Then, the full application must be organized around this given building block size. Now, due to phenomena involved in the data transfers between various memory levels across CPUs and GPUs, the optimality of such a scheme is only loosely predictable (as communications tend to overcome in time the complexity of computations). Therefore a mix of (theoretical) analytic approach and (practical) runtime validation is here needed. As we shall illustrate, this occurs at both stage, first at the level of deciding on a given elementary FFT block size, then at the full application level.
Mohamed Amine Bergach, Emilien Kofman, Robert de Simone, Serge Tissot, Michel Syska. Efficient FFT mapping on GPU for radar processing application: modeling and implementation. arXiv:1505.08067 [cs.MS]