Simulators are still the primary tools for development and performance evaluation of applications running on massively parallel architectures. However, current virtual platforms are not able to tackle the complexity issues introduced by 1000-core future scenarios. We present a fast and accurate simulation framework targeting extremely large parallel systems by specifically taking advantage of the inherent potential processing parallelism available in modern GPGPUs.
(S. Raghav, M. Ruggiero, D. Atienza, C. Pinto, A. Marongiu and L. Benini: “Scalable instruction set simulator for thousand-core architectures running on GPGPUs”, Proceedings of High Performance Computing and Simulation (HPCS), pp.459-466, June/July 2010. [DOI] [WWW])
Ocelot 2.0.969 brings CUDA 3.2 and Fermi support to a stable release. Ocelot is a BSD-licensed open source implementation of the CUDA runtime, a PTX emulator, and a mid-level PTX compiler.
Here is a feature list for 2.0.969:
- PTX 2.2 and Fermi device support: Floating point results should be within the ULP limits in the PTX ISA manual. Over 500 unit tests verify that the behaviour matches NVIDIA devices.
- Four target device types: A functional PTX emulator. A PTX to LLVM to x86/ARM JIT. A PTX to CAL JIT for AMD devices (beta). A PTX to PTX JIT for NVIDIA devices.
- A full-featured PTX 2.2 IR: An analysis/optimization pass interface over PTX (Control flow graph, dataflow graph, dominator/postdominator trees, structured control tree). Optimizations can be plugged in as modules.
- Correctness checking tools: A memory checker (detects unaligned and out of bounds accesses). A race detector. An interactive debugger (allows stepping through PTX instructions).
- An instruction trace analyzer interface: Allows user-defined modules to receive callbacks when PTX instructions are executed. Can be used to compute metrics over applications or perform correctness checks.
- A CUDA API frontend: Existing CUDA programs can be directly linked against Ocelot. Device pointers can be shared across host threads. Multiple devices can be controlled from the same host thread (cudaSetDevice can be called multiple times).
Ocelot is available under a BSD license at http://code.google.com/p/gpuocelot.
Barra, developed by Sylvain Collange, Marc Daumas, David Defour and David Parello from Université de Perpignan, simulates CUDA programs at the assembly language level (NVIDIA PTX ISA). Its ultimate goal is to provide a 100% bit-accurate simulation, offering bug-for-bug compatibility with NVIDIA G80-based GPUs. It works directly with CUDA executables; neither source modification nor recompilation is required. Barra is primarily intended as a tool for research on computer architecture, although it can also be used to debug, profile and optimize CUDA programs at the lowest level. For more details and downloads, see the Barra wiki. A technical report is also available.
From the abstract:
Modern GPUs provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow’s manycore processors, whether those are GPUs or otherwise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important applications, even those with abundant data-level parallelism, do not achieve peak performance. This paper characterizes several non-graphics applications written in NVIDIA’s CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIA’s parallel thread execution (PTX) virtual instruction set. For this study, we selected twelve non-trivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware (versus a CPU-only sequential version of the application). We study the performance of these applications on our GPU performance simulator with configurations comparable to contemporary high-end graphics cards. We characterize the performance impact of several microarchitecture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel workload distribution mechanisms, and memory request coalescing hardware. Two observations we make are (1) that for the applications we study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and (2) that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system.
Ali Bakhoda, George L. Yuan, Wilson W.L. Fung, Henry Wong and Tor M. Aamondt: Analyzing CUDA Workloads Using a Detailed GPU Simulator, 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). (GPGPU-Sim website)