As the number of cores on a chip increase and key applications become even more data-intensive, memory systems in modern processors have to deal with increasingly large amount of data. In face of such challenges, data compression presents as a promising approach to increase effective memory system capacity and also provide performance and energy advantages. This paper presents a survey of techniques for using compression in cache and main memory systems. It also classifies the techniques based on key parameters to highlight their similarities and differences. It discusses compression in CPUs and GPUs, conventional and non-volatile memory (NVM) systems, and 2D and 3D memory systems. We hope that this survey will help the researchers in gaining insight into the potential role of compression approach in memory components of future extreme-scale systems.
Sparsh Mittal and Jeffrey Vetter, “A Survey Of Architectural Approaches for Data Compression in Cache and Main Memory Systems”, IEEE TPDS 2015. WWW
As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and application level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). We believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.
Sparsh Mittal and Jeffrey Vetter, “A Survey of CPU-GPU Heterogeneous Computing Techniques”, accepted in ACM Computing Surveys, 2015. WWW
Recent trends of aggressive technology scaling have greatly exacerbated the occurrences and impact of faults in computing systems. This has made `reliability’ a first-order design constraint. To address the challenges of reliability, several techniques have been proposed. This paper provides a survey of architectural techniques for improving resilience of computing systems. We especially focus on techniques proposed for microarchitectural components, such as processor registers, functional units, cache and main memory etc. In addition, we discuss techniques proposed for non-volatile memory (NVM), GPUs and 3D-stacked processors. To underscore the similarities and differences of the techniques, we classify them based on their key characteristics. We also review the metrics proposed to quantify vulnerability of processor structures. We believe that this survey will help researchers, system-architects and processor designers in gaining insights into the techniques for improving reliability of computing systems.
Sparsh Mittal, Jeffrey S Vetter, “A Survey of Techniques for Modeling and Improving Reliability of Computing Systems”, in IEEE TPDS, 2015. WWW
To minimize interference in LTE networks, several inter-cell interference coordination (ICIC) techniques have been introduced. Among them, semi-static ICIC offers a balanced trade-off between applicability and system performance. The power allocation per resource block and cell is adapted in the range of seconds according to the load in the system. An open issue in the literature is the question how fast the adaptation should be performed. This leads basically to a trade-off between system performance and feasible computation times of the associated power allocation problems. In this work, we close this open issue by studying the impact that different durations of update times of semi-static ICIC have on the system performance. We conduct our study on realistic scenarios considering also the mobility of mobile terminals. Secondly, we also consider the implementation aspects of a semi-static ICIC. We introduce a very efficient implementation on general purpose graphic processing units, harnessing the parallel computing capability of such devices. We show that the update periods have a significant impact on the performance of cell edge terminals. Additionally, we present a graphic processing unit (GPU) based implementation which speeds up existing implementations up to a factor of 92x.
Parruca, Donald and Aizaz, Fahad and Chantaraskul, Soamsiri and Gross, James. “Semi-static Interference Coordination in OFDMA/LTE Networks: Evaluation of Practical Aspects. In Proceedings of the 17th ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems, pp 87-94 2014.
We introduce a practical partitioning technique designed for parallelizing Position Based Dynamics, and exploiting the ubiquitous multi-core processors present in current commodity GPUs. The input is a set of particles whose dynamics is influenced by spatial constraints. In the initialization phase, we build a graph in which each node corresponds to a constraint and two constraints are connected by an edge if they influence at least one common particle. We introduce a novel greedy algorithm for inserting additional constraints (phantoms) in the graph such that the resulting topology is qˆ-colourable, where qˆ ≥ 2 is an arbitrary number. We color the graph, and the constraints with the same color are assigned to the same partition. Then, the set of constraints belonging to each partition is solved in parallel during the animation phase. We demonstrate this by using our partitioning technique; the performance hit caused by the GPU kernel calls is significantly decreased, leaving unaffected the visual quality, robustness and speed of serial position based dynamics.
(Fratarcangeli M and Pellacini F, Scalable Partitioning for Parallel Position Based Dynamics, Computer Graphics Forum (Special Issue of Eurographics 2015 Conference). Vol. 34(2) 2015)
Recent technological advances have greatly improved the performance and features of embedded systems. With the number of just mobile devices now reaching nearly equal to the population of earth, embedded systems have truly become ubiquitous. These trends, however, have also made the task of managing their power consumption extremely challenging. In recent years, several techniques have been proposed to address this issue. In this paper, we survey the techniques for managing power consumption of embedded systems. We discuss the need of power management and provide a classification of the techniques on several important parameters to highlight their similarities and differences. This paper also reviews those techniques which use GPU and FPGA to improve energy efficiency of embedded systems. This paper is intended to help the researchers and application-developers in gaining insights into the working of power management techniques and designing even more efficient high-performance embedded systems of tomorrow.
Sparsh Mittal, “A Survey of Techniques For Improving Energy Efficiency in Embedded Computing Systems”, International Journal of Computer Aided Engineering and Technology (IJCAET), vol 6, no. 4, 2014. WWW
GPUs play an increasingly important role in high-performance computing. While developing naive code is straightforward, optimizing massively parallel applications requires deep understanding of the underlying architecture. The developer must struggle with complex index calculations and manual memory transfers. This article classifies memory access patterns used in most parallel algorithms, based on Berkeley’s Parallel “Dwarfs.” It then proposes the MAPS framework, a device-level memory abstraction that facilitates memory access on GPUs, alleviating complex indexing using on-device containers and iterators. This article presents an implementation of MAPS and shows that its performance is comparable to carefully optimized implementations of real-world applications.
Rubin, Eri, et al. ["MAPS: Optimizing Massively Parallel Applications Using Device-Level Memory Abstraction."](http://dl.acm.org/citation.cfm?id=2680544) ACM Transactions on Architecture and Code Optimization (TACO) 11.4 (2014): 44.
The cellular process responsible for providing energy for most life on Earth, namely, photosynthetic light-harvesting, requires the cooperation of hundreds of proteins across an organelle, involving length and time scales spanning several orders of magnitude over quantum and classical regimes. Simulation and visualization of this fundamental energy conversion process pose many unique methodological and computational challenges. We present, in an accompanying movie, light-harvesting in the photosynthetic apparatus found in purple bacteria, the so-called chromatophore. The movie is the culmination of three decades of modeling efforts, featuring the collaboration of theoretical, experimental, and computational scientists. We describe the techniques that were used to build, simulate, analyze, and visualize the structures shown in the movie, and we highlight cases where scientific needs spurred the development of new parallel algorithms that efficiently harness GPU accelerators and petascale computers.
Visualization of Energy Conversion Processes in a Light Harvesting Organelle at Atomic Detail. M. Sener, J. E. Stone, A. Barragan, A. Singharoy, I. Teo, K. L. Vandivort, B. Isralewitz, B. Liu, B. Goh, J. C. Phillips, L. F. Kourkoutis, C. N. Hunter, and K. Schulten. SC’14 Visualization and Data Analytics Showcase, 2014. Paper PDF
Initially introduced as special-purpose accelerators for graphics applications, graphics processing units (GPUs) have now emerged as general purpose computing platforms for a wide range of applications. To address the requirements of these applications, modern GPUs include sizable hardware-managed caches. However, several factors, such as unique architecture of GPU, rise of CPU-GPU heterogeneous computing, etc., demand effective management of caches to achieve high performance and energy efficiency. Recently, several techniques have been proposed for this purpose. In this paper, we survey several architectural and system-level techniques proposed for managing and leveraging GPU caches. We also discuss the importance and challenges of cache management in GPUs. The aim of this paper is to provide the readers insights into cache management techniques for GPUs and motivate them to propose even better techniques for leveraging the full potential of caches in the GPUs of tomorrow.
Sparsh Mittal, “A Survey Of Techniques for Managing and Leveraging Caches in GPUs”, Journal of Circuits, Systems, and Computers (JCSC), vol. 23, no. 8, 2014. WWW
Recent years have witnessed a phenomenal growth in the computational capabilities and applications of GPUs. However, this trend has also led to dramatic increase in their power consumption. This paper surveys research works on analyzing and improving energy efficiency of GPUs. It also provides a classification of these techniques on the basis of their main research idea. Further, it attempts to synthesize research works which compare energy efficiency of GPUs with other computing systems, e.g. FPGAs and CPUs. The aim of this survey is to provide researchers with knowledge of state-of-the-art in GPU power management and motivate them to architect highly energy-efficient GPUs of tomorrow.
Sparsh Mittal, Jeffrey S Vetter, “A Survey of Methods for Analyzing and Improving GPU Energy Efficiency”, in ACM Computing Surveys, vol. 47, no. 2, pp. 19:1-19:23, 2014. [WWW]