As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality systematically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: 1) the GPU’s hierarchy of threads, warps, threadblocks, and sets of active threads, 2) conditional and non-uniform latencies, 3) cache associativity, 4) miss-status holding-registers, and 5) warp divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.
(Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, Henri Bal: “A Detailed GPU Cache Model Based on Reuse Distance Theory”, in High Performance Computer Architecture (HPCA), 2014, [PDF])
From a recent press release:
The versatile Nema™ Platform for General-Purpose Computing on an embedded GPU (GPGPU) is designed by Think Silicon for excellent performance with ultra-low energy consumption and silicon footprint, and is available now from CAST, Inc.
Designed by graphics processing experts Think Silicon Ltd., the Nema GPU is a scalable, many-core, multi-threaded, state-of-the-art, data processing design blending both graphics rendering and general computing capabilities. It offers easy configuration, rapid programming, and straightforward system integration in a reusable soft IP core suitable for ASIC or FPGA implementation.
Read the rest of this entry »
Hardware and compiler techniques for mapping data-parallel programs with divergent control flow to SIMD architectures have recently enabled the emergence of new GPGPU programming models such as CUDA, OpenCL, and DirectX Compute. The impact of branch divergence can be quite different depending upon whether the program’s control flow is structured or unstructured. In this paper, we show that unstructured control flow occurs frequently in applications and can lead to significant code expansion when executed using existing approaches for handling branch divergence. This paper proposes a new technique for automatically mapping arbitrary control flow onto SIMD processors that relies on a concept of a “Thread Frontier”, which is a statically bounded region of the program
containing all threads that have branched away from the current warp. This technique is evaluated on a GPU emulator configured to model i) a commodity GPU (Intel Sandybridge), and ii) custom hardware support not realized in current GPU architectures. It is shown that this new technique performs identically to the best existing method for structured control flow, and re-converges at the earliest possible point when executing unstructured control flow. This leads to i) between 1.5-633.2% reductions in dynamic instruction counts for several real applications, ii) simplification of the compilation process, and iii) ability to efficiently add high level unstructured programming constructs (e.g., exceptions) to existing data-parallel languages.
(Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu and Sudhakar Yalamanchili: “SIMD Re-convergence at Thread Frontiers”. 44th International Symposium on Microarchitecture (MICRO 44), 2011. [WWW])
For workloads with abundant parallelism, GPUs deliver higher peak computational throughput than latency-oriented CPUs. Key insights of this article: Throughput-oriented processors tackle problems where parallelism is abundant, yielding design decisions different from more traditional latency oriented processors. Due to their design, programming throughput-oriented processors requires much more emphasis on parallelism and scalability than programming sequential processors. GPUs are the leading exemplars of modern throughput-oriented architecture, providing a ubiquitous commodity platform for exploring throughput-oriented programming.
(Michael Garland and David B. Kirk, “Understanding throughput-oriented architectures”, Commununications of the ACM 53(11), 58-66, Nov. 2010. [DOI])