General-purpose multiprocessors (as, in our case, Intel IvyBridge and Intel Haswell) increasingly add GPU computing power to the former multicore architectures. When used for embedded applications (for us, Synthetic aperture radar) with intensive signal processing requirements, they must constantly compute convolution algorithms, such as the famous Fast Fourier Transform. Due to its ”fractal” nature (the typical butterfly shape, with larger FFTs defined as combination of smaller ones with auxiliary data array transpose functions), one can hope to compute analytically the size of the largest FFT that can be performed locally on
an elementary GPU compute block. Then, the full application must be organized around this given building block size. Now, due to phenomena involved in the data transfers between various memory levels across CPUs and GPUs, the optimality of such a scheme is only loosely predictable (as communications tend to overcome in time the complexity of computations). Therefore a mix of (theoretical) analytic approach and (practical) runtime validation is here needed. As we shall illustrate, this occurs at both stage, first at the level of deciding on a given elementary FFT block size, then at the full application level.
Mohamed Amine Bergach, Emilien Kofman, Robert de Simone, Serge Tissot, Michel Syska. Efficient FFT mapping on GPU for radar processing application: modeling and implementation. arXiv:1505.08067 [cs.MS]
The use of graphic processor units (GPUs) has been recently proposed in computational electromagnetics to accelerate the solution of the electric field integral equation. In these methods, the linear systems obtained by using boundary elements are considered, and then an accelerated solution for a specific excitation is obtained. The existing studies are mostly focused on speeding up the filling time or the LU decomposition of that matrix. This limits the application to simple simulation scenarios if a fast method is not employed. In this paper, we propose a GPU acceleration for FFT-based integral equation solvers. We will investigate the operations involved in the solver, and we will motivate the use of GPUs. Results of numerical tests will be reported firstly on a perfect electric conductor sphere with different radii; then a realistic aircraft will be considered. We found that using GPUs for FFT-based methods allows achieving a reasonable speed-up.
(Elia A. Attardo1, Matteo A. Francavilla, Francesca Vipiana and Giuseppe Vecchi: “Investigation on Accelerating FFT-Based Methods for the EFIE on Graphics Processors”, International Journal of Numerical Modelling: Electronic Networks, Devices and Fields, to appear, Nov. 2012. [DOI])
This new report covers all the performance improvements in the latest CUDA Toolkit 3.2 release, and compares CUDA parallel math library performance vs. commonly used CPU libraries.
Learn about the performance advantages of using the CUDA parallel math libraries for FFT, BLAS, sparse matrix operations, and random number generation.
This work approaches the fundamental problem of accelerating FFT computation by use of GPUs, in order to apply it to Adaptive Optics, the key for obtaining the maximum performance from projected ground-based eXtremely Large Telescopes. A method to efficiently adapt the FFT for the underlying architecture of GPUs is given. The authors derive a novel FFT method that alternates base-2 and base-4 decomposition of the bidimensional domain to take the most from Multiple Render Target extension as they elaborate a very unusual Pease 8-data “butterfly”. (Modal Fourier wavefront reconstruction using GPUs J.G. Marichal-Hernandez, J.M. Rodriguez-Ramos, F. Rosa. La Laguna University. To appear in Journal of Electronic Imaging.)
This white paper from RapidMind and HP compares the performance of BLAS dense linear algebra operations, the FFT, and European option pricing on the GPU against highly tuned CPU implementations on the fastest available CPUs. All of the GPU implementations were made using the RapidMind Development Platform, which allows the use of standard C++ programming to create high-performance parallel applications that run on the GPU. The full source for the samples is available in conjunction with a new beta version of the RapidMind development platform. The results will also be presented as a poster at SC06.
This Supercomputing 2006 paper by Govindaraju et al. presents a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). The memory model is based on texturing hardware, which uses a 2D block-based array representation to perform the underlying computations. It incorporates many characteristics of GPU architectures including smaller cache sizes, 2D block representations, and uses the 3C’s model to analyze the cache misses. Moreover, the paper presents techniques to improve the performance of nested loops on GPUs. In order to demonstrate the effectiveness of the model, the paper highlights its performance on three memory-intensive scientific applications: sorting, Fast Fourier Transform and dense matrix multiplication. In practice, their cache-efficient algorithms for these applications are able to achieve memory throughput of 30-50 GB/s on an NVIDIA 7900 GTX GPU. The paper also compares its results with prior GPU-based and CPU-based implementations on high-end processors. In practice, they are able to achieve 2x-5x performance improvement. (A Memory Model for Scientic Algorithms on Graphics Processors)
This paper by Govindaraju et al. describes a high-performance FFT algorithm on GPUs. The algorithm is highly tuned for GPUs using memory optimizations. It further improves performance using pipelining strategies. In practice, it is able to achieve 4x higher computational performance on a $500 NVIDIA GPU than optimized single precision FFT algorithms on high-end CPUs costing $1500. (“Efficient memory model for scientific algorithms on graphics processors”, Naga Govindaraju, Scott Larsen, Jim Gray and Dinesh Manocha, UNC Tech. Report 2006)
This paper by Jansen et al. describes how to utilize current commodity graphics hardware to perform Fourier volume rendering directly on the GPU. The paper presents a novel implementation of the Fast Fourier Transform: This Split-Stream-FFT maps the recursive structure of the FFT to the GPU in an efficient way. Additionally, high-quality resampling within the frequency domain is discussed. The implementation enables visualization of large volumetric data sets at interactive frame rates on a mid-range computer system. (Fourier Volume Rendering on the GPU Using a Split-Stream FFT)
This GH 2003 paper by Kenneth Moreland and Edward Angel describes how to use programmable GPUs to perform the fast Fourier transform (FFT). Their system that can synthesize an image by conventional means, perform the FFT, filter the image, and finally apply the inverse FFT in well under 1 second for a 512 by 512 image. (The FFT on a GPU. Kenneth Moreland and Edward Angel. Graphics Hardware 2003.)