As the number of cores on a chip increase and key applications become even more data-intensive, memory systems in modern processors have to deal with increasingly large amount of data. In face of such challenges, data compression presents as a promising approach to increase effective memory system capacity and also provide performance and energy advantages. This paper presents a survey of techniques for using compression in cache and main memory systems. It also classifies the techniques based on key parameters to highlight their similarities and differences. It discusses compression in CPUs and GPUs, conventional and non-volatile memory (NVM) systems, and 2D and 3D memory systems. We hope that this survey will help the researchers in gaining insight into the potential role of compression approach in memory components of future extreme-scale systems.
Sparsh Mittal and Jeffrey Vetter, “A Survey Of Architectural Approaches for Data Compression in Cache and Main Memory Systems”, IEEE TPDS 2015. WWW
As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and application level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). We believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.
Sparsh Mittal and Jeffrey Vetter, “A Survey of CPU-GPU Heterogeneous Computing Techniques”, accepted in ACM Computing Surveys, 2015. WWW
Geometric Algebra is a new, geometrically intuitive mathematical system. It provides very easy algorithms for many application areas such as computer graphics, computer vision, robotics and computer simulations. The HSA Foundation (Heterogeneous System Architecture Foundation) is a not-for-profit industry standards body founded by companies such as AMD, ARM Samsung and Texas Instruments and focused on making it dramatically easier to program heterogeneous computing devices such as GPUs.
Since Gaalop (Geometric algebra algorithms optimizer) is focusing exactly on the optimization and integration of Geometric Algebra in these kind of new parallel computing architectures, this technology together with the new Kalmar C++ AMP compiler provides a solution for Math, Science & Engineering for HSA.
Developers have been using utility tools such as CPU-Z, GPU-Z, CUDA-Z, OpenCL-Z for a long time. These tools provide platform and hardware information in details and help developers quickly understand the hardware capabilities.
Recently, OpenCL has been supported by most of the latest mobile phones/tablets, as the mobile GPUs are gaining more compute power. OpenCL-A Android can help developer to quickly detect the availability of the OpenCL on a device, and get information about OpenCL-capable platform and devices.
In addition to detecting the OpenCL capability and getting device information, the OpenCL-Z Android is also able to measure the raw compute power in terms of ALU peak GFLOPS performance and memory bandwidth performance. These numbers would be useful for developers who want to take advantage of GPU compute capability of the modern GPU. The developers can roughly predict the performance of a certain algorithm targeting on a specific platform, or compare the raw compute performance among platforms.
The OpenCL-Z Android is a free software and it is now available on Google Play:
Download link at Google Play
The major features of OpenCL-Z Android:
– detect OpenCL availability;
– detect OpenCL driver library;
– display detailed OpenCL platform information;
– display detailed OpenCL device information;
– measure the raw compute performance and memory system bandwidth;
– export OpenCL information to sdcard;
– share OpenCL information with other applications, such as e-mail clients, note applications, social media and so on.
The OpenCL-Z Android has been tested on mobile devices with Qualcomm Snapdragon 8064, 8974, 8084, 8994 chipsets (with Adreno 305, 320, 330, 420, 430 GPUs), Samsung Exynos 5420, 5433 chipsets (with Mali T628, T760 GPUs), MediaTek MT6752 chipset (with Mali T760 GPU), Rockchip RK3288 (with Mali T764 GPU).
The OpenCL-Z Android should be able to support other chipsets. If your device is known to have OpenCL support, but this tool fails to detect it, please contact the developer of OpenCL-Z.
The author of OpenCL-Z is also trying to create a relatively complete list of mobile devices that support OpenCL, the list can be found at the OpenCL-Z official website . If you see any device supporting OpenCL not on that list, please send the author an email and help the list grow.
Recent trends of aggressive technology scaling have greatly exacerbated the occurrences and impact of faults in computing systems. This has made `reliability’ a first-order design constraint. To address the challenges of reliability, several techniques have been proposed. This paper provides a survey of architectural techniques for improving resilience of computing systems. We especially focus on techniques proposed for microarchitectural components, such as processor registers, functional units, cache and main memory etc. In addition, we discuss techniques proposed for non-volatile memory (NVM), GPUs and 3D-stacked processors. To underscore the similarities and differences of the techniques, we classify them based on their key characteristics. We also review the metrics proposed to quantify vulnerability of processor structures. We believe that this survey will help researchers, system-architects and processor designers in gaining insights into the techniques for improving reliability of computing systems.
Sparsh Mittal, Jeffrey S Vetter, “A Survey of Techniques for Modeling and Improving Reliability of Computing Systems”, in IEEE TPDS, 2015. WWW
Stanford, CA – 21 April 2015. The organisers of IWOCL (“eye-wok-ul”), the International Workshop on OpenCL, today announced that AMD and HP have sponsored the Advanced Hands-On OpenCL Tutorial that will kick-off IWOCL 2015. The tutorial, which will focus on advanced OpenCL concepts, is an extension of the highly successful ‘Hands on OpenCL’ course which has received over 3,000 downloads. Simon McIntosh-Smith, Senior Lecturer in High Performance Computing and Architectures at the University of Bristol and one of the authors of the original open-source course will lead the tutorial.
The full-day Advanced Hands-On OpenCL tutorial takes place on Monday 11th May at the Li Ka Shing Center, Stanford University. Registration is $145. For additional information visit: http://www.iwocl.org/conf-2015/handsonopencl-tutorial/ Read the rest of this entry »
To minimize interference in LTE networks, several inter-cell interference coordination (ICIC) techniques have been introduced. Among them, semi-static ICIC offers a balanced trade-off between applicability and system performance. The power allocation per resource block and cell is adapted in the range of seconds according to the load in the system. An open issue in the literature is the question how fast the adaptation should be performed. This leads basically to a trade-off between system performance and feasible computation times of the associated power allocation problems. In this work, we close this open issue by studying the impact that different durations of update times of semi-static ICIC have on the system performance. We conduct our study on realistic scenarios considering also the mobility of mobile terminals. Secondly, we also consider the implementation aspects of a semi-static ICIC. We introduce a very efficient implementation on general purpose graphic processing units, harnessing the parallel computing capability of such devices. We show that the update periods have a significant impact on the performance of cell edge terminals. Additionally, we present a graphic processing unit (GPU) based implementation which speeds up existing implementations up to a factor of 92x.
Parruca, Donald and Aizaz, Fahad and Chantaraskul, Soamsiri and Gross, James. “Semi-static Interference Coordination in OFDMA/LTE Networks: Evaluation of Practical Aspects. In Proceedings of the 17th ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems, pp 87-94 2014.
Culises significantly accelerates your OpenFOAM® application by using GPUs for the computationally most intensive tasks.
Its main features are
- Library for GPU-based acceleration of OpenFOAM®
- Multi-GPU support, significantly reduced computing times
- Highly efficient state-of-the-art iterative solvers like AMG
- Quick and easy installation, no validation necessary
- Flexible interfaces to customer-specific software/engineering applications available
The acceleration of the linear solver by Culises is greater than 2x. The overall speedup depends on the type of application and the time spent in the linear solver. Culises my be tested on FluiDyna’s purpose-built workstation to determine the acceleration potential for your individual OpenFOAM® application. Find out more on: www.culises.com
A new open-source CFD project have just been published. RapidCFD is a new open-source CFD project that uses NVIDIA CUDA for the entire calculation process which gives a significant reduction in computation time.
- most incompressible and compressible solvers on static mesh are available
- all the calculations are done on the GPU
- no overhead for GPU-CPU memory copy
- can run in parallel on multiple GPUs
Visit RapidCFD project page.
Mobile devices, such as phones and tablets, offer a plethora of media-rich applications such as photo and video recording and editing, natural user interfaces and computer vision. Other areas of embedded image systems are characterized by close-to-sensor processing, such as advanced driver assistance systems, mobile scanners, and smart devices used in medical and industrial imaging. This demands highest computing capabilities at stringent resource and power budgets as well as hard real-time constraints.
Future scaling of computing performance mandates dramatically improving energy efficiency of image systems. One recognized trend is to use heterogeneous hardware such as big.LITTLE cores and accelerators such as DSPs, embedded GPUs, FPGAs, or dedicated hardware. Another trend is to use new 3D integrated circuit technologies that allow for tighter integration of compute cores, memory and sensors to reduce communication latency and improve bandwidth, leading to lower energy consumption.
This calls for novel methodologies for designing heterogeneous hardware, as well shielding software developers from growing complexity and allowing them to concentrate on algorithm development rather than on low level implementation details. Read the rest of this entry »