General-purpose multiprocessors (as, in our case, Intel IvyBridge and Intel Haswell) increasingly add GPU computing power to the former multicore architectures. When used for embedded applications (for us, Synthetic aperture radar) with intensive signal processing requirements, they must constantly compute convolution algorithms, such as the famous Fast Fourier Transform. Due to its ”fractal” nature (the typical butterfly shape, with larger FFTs defined as combination of smaller ones with auxiliary data array transpose functions), one can hope to compute analytically the size of the largest FFT that can be performed locally on
an elementary GPU compute block. Then, the full application must be organized around this given building block size. Now, due to phenomena involved in the data transfers between various memory levels across CPUs and GPUs, the optimality of such a scheme is only loosely predictable (as communications tend to overcome in time the complexity of computations). Therefore a mix of (theoretical) analytic approach and (practical) runtime validation is here needed. As we shall illustrate, this occurs at both stage, first at the level of deciding on a given elementary FFT block size, then at the full application level.
Mohamed Amine Bergach, Emilien Kofman, Robert de Simone, Serge Tissot, Michel Syska. Efficient FFT mapping on GPU for radar processing application: modeling and implementation. arXiv:1505.08067 [cs.MS]
The use of GPU computing in FEA is today an active research field. This is primary due to current GPU sparse solvers are partially parallelizable and can hardly make use of Data-Level Parallelism (DLP) for which GPU architectures are designed. This paper proposes a fine-grained implementation of matrix-free Conjugate Gradient (CG) solver for Finite Element Analysis (FEA) using Graphics Processing Unit (GPU) architectures. The proposed GPU instance takes advantage of Massively Parallel Processing (MPP) architectures performing well-balanced parallel calculations at the Degree-of-Freedom (DoF) level of finite elements. The numerical experiments evaluate and analyze the performance of diverse GPU instances of the matrix-free CG solver.
Jesús Martínez-Frutos, Pedro J. Martínez-Castejón, David Herrero-Pérez, Fine-grained GPU implementation of assembly-free iterative solver for finite element problems, Computers & Structures, Volume 157, September 2015, Pages 9-18, ISSN 0045-7949, http://dx.doi.org/10.1016/j.compstruc.2015.05.010.
As the number of cores on a chip increase and key applications become even more data-intensive, memory systems in modern processors have to deal with increasingly large amount of data. In face of such challenges, data compression presents as a promising approach to increase effective memory system capacity and also provide performance and energy advantages. This paper presents a survey of techniques for using compression in cache and main memory systems. It also classifies the techniques based on key parameters to highlight their similarities and differences. It discusses compression in CPUs and GPUs, conventional and non-volatile memory (NVM) systems, and 2D and 3D memory systems. We hope that this survey will help the researchers in gaining insight into the potential role of compression approach in memory components of future extreme-scale systems.
Sparsh Mittal and Jeffrey Vetter, “A Survey Of Architectural Approaches for Data Compression in Cache and Main Memory Systems”, IEEE TPDS 2015. WWW
As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and application level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). We believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.
Sparsh Mittal and Jeffrey Vetter, “A Survey of CPU-GPU Heterogeneous Computing Techniques”, accepted in ACM Computing Surveys, 2015. WWW
Recent trends of aggressive technology scaling have greatly exacerbated the occurrences and impact of faults in computing systems. This has made `reliability’ a first-order design constraint. To address the challenges of reliability, several techniques have been proposed. This paper provides a survey of architectural techniques for improving resilience of computing systems. We especially focus on techniques proposed for microarchitectural components, such as processor registers, functional units, cache and main memory etc. In addition, we discuss techniques proposed for non-volatile memory (NVM), GPUs and 3D-stacked processors. To underscore the similarities and differences of the techniques, we classify them based on their key characteristics. We also review the metrics proposed to quantify vulnerability of processor structures. We believe that this survey will help researchers, system-architects and processor designers in gaining insights into the techniques for improving reliability of computing systems.
Sparsh Mittal, Jeffrey S Vetter, “A Survey of Techniques for Modeling and Improving Reliability of Computing Systems”, in IEEE TPDS, 2015. WWW
To minimize interference in LTE networks, several inter-cell interference coordination (ICIC) techniques have been introduced. Among them, semi-static ICIC offers a balanced trade-off between applicability and system performance. The power allocation per resource block and cell is adapted in the range of seconds according to the load in the system. An open issue in the literature is the question how fast the adaptation should be performed. This leads basically to a trade-off between system performance and feasible computation times of the associated power allocation problems. In this work, we close this open issue by studying the impact that different durations of update times of semi-static ICIC have on the system performance. We conduct our study on realistic scenarios considering also the mobility of mobile terminals. Secondly, we also consider the implementation aspects of a semi-static ICIC. We introduce a very efficient implementation on general purpose graphic processing units, harnessing the parallel computing capability of such devices. We show that the update periods have a significant impact on the performance of cell edge terminals. Additionally, we present a graphic processing unit (GPU) based implementation which speeds up existing implementations up to a factor of 92x.
Parruca, Donald and Aizaz, Fahad and Chantaraskul, Soamsiri and Gross, James. “Semi-static Interference Coordination in OFDMA/LTE Networks: Evaluation of Practical Aspects. In Proceedings of the 17th ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems, pp 87-94 2014.
Recent technological advances have greatly improved the performance and features of embedded systems. With the number of just mobile devices now reaching nearly equal to the population of earth, embedded systems have truly become ubiquitous. These trends, however, have also made the task of managing their power consumption extremely challenging. In recent years, several techniques have been proposed to address this issue. In this paper, we survey the techniques for managing power consumption of embedded systems. We discuss the need of power management and provide a classification of the techniques on several important parameters to highlight their similarities and differences. This paper also reviews those techniques which use GPU and FPGA to improve energy efficiency of embedded systems. This paper is intended to help the researchers and application-developers in gaining insights into the working of power management techniques and designing even more efficient high-performance embedded systems of tomorrow.
Sparsh Mittal, “A Survey of Techniques For Improving Energy Efficiency in Embedded Computing Systems”, International Journal of Computer Aided Engineering and Technology (IJCAET), vol 6, no. 4, 2014. WWW
GPUs play an increasingly important role in high-performance computing. While developing naive code is straightforward, optimizing massively parallel applications requires deep understanding of the underlying architecture. The developer must struggle with complex index calculations and manual memory transfers. This article classifies memory access patterns used in most parallel algorithms, based on Berkeley’s Parallel “Dwarfs.” It then proposes the MAPS framework, a device-level memory abstraction that facilitates memory access on GPUs, alleviating complex indexing using on-device containers and iterators. This article presents an implementation of MAPS and shows that its performance is comparable to carefully optimized implementations of real-world applications.
Rubin, Eri, et al. [“MAPS: Optimizing Massively Parallel Applications Using Device-Level Memory Abstraction.”](http://dl.acm.org/citation.cfm?id=2680544) ACM Transactions on Architecture and Code Optimization (TACO) 11.4 (2014): 44.
Initially introduced as special-purpose accelerators for graphics applications, graphics processing units (GPUs) have now emerged as general purpose computing platforms for a wide range of applications. To address the requirements of these applications, modern GPUs include sizable hardware-managed caches. However, several factors, such as unique architecture of GPU, rise of CPU-GPU heterogeneous computing, etc., demand effective management of caches to achieve high performance and energy efficiency. Recently, several techniques have been proposed for this purpose. In this paper, we survey several architectural and system-level techniques proposed for managing and leveraging GPU caches. We also discuss the importance and challenges of cache management in GPUs. The aim of this paper is to provide the readers insights into cache management techniques for GPUs and motivate them to propose even better techniques for leveraging the full potential of caches in the GPUs of tomorrow.
Sparsh Mittal, “A Survey Of Techniques for Managing and Leveraging Caches in GPUs”, Journal of Circuits, Systems, and Computers (JCSC), vol. 23, no. 8, 2014. WWW
The wide majority of current state-of-the-art compressed GPU volume renderers are based on block-transform coding, which is susceptible to blocking artifacts, particularly at low bit-rates. In this paper the authors address the problem for the first time, by introducing a specialized deferred filtering architecture working on block-compressed data and including a novel deblocking algorithm. The architecture efficiently performs high quality shading of massive datasets by closely coordinating visibility- and resolution-aware adaptive data loading with GPU-accelerated per-frame data decompression, deblocking, and rendering. A thorough evaluation including quantitative and qualitative measures demonstrates the performance of our approach on large static and dynamic datasets including a massive 512^4 turbulence simulation (256GB), which is aggressively compressed to less than 2 GB, so as to fully upload it on graphics board and to explore it in real-time during animation.
(Fabio Marton, José Antonio Iglesias Guitián, Jose Díaz and Enrico Gobbetti: “Real-time deblocked GPU rendering of compressed volumes”. Proc. 19th International Workshop on Vision, Modeling and Visualization (VMV), pp. 167-174, Oct. 2014. [WWW])