Intel Releases Knights Corner

June 2nd, 2010

At ISC’10, Intel demonstrated their co-processor approach to HPC (formerly known as Larrabee, now codenamed Knights Corner). A prototype of the Intel Many Integrated Core (MIC) architecture with 32 in-order cores, each equipped with a 512-wide vector unit and connected via an on-chip coherent cache, delivered more than half a Teraflop performance for LU decomposition in a live demonstration during a keynote by Kirk Skaugen.

The full press release from ISC’10 is available here.

Analyzing CUDA Workloads Using a Detailed GPU Simulator

May 4th, 2009

From the abstract:

Modern GPUs provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow’s manycore processors, whether those are GPUs or otherwise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important applications, even those with abundant data-level parallelism, do not achieve peak performance. This paper characterizes several non-graphics applications written in NVIDIA’s CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIA’s parallel thread execution (PTX) virtual instruction set. For this study, we selected twelve non-trivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware (versus a CPU-only sequential version of  the application). We study the performance of these applications on our GPU performance simulator with configurations comparable to contemporary high-end graphics cards. We characterize the performance impact of several microarchitecture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel workload distribution mechanisms, and memory request coalescing hardware. Two observations we make are (1) that for the applications we study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and (2) that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system.

Ali Bakhoda, George L. Yuan, Wilson W.L. Fung, Henry Wong and Tor M. Aamondt: Analyzing CUDA Workloads Using a Detailed GPU Simulator, 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). (GPGPU-Sim website)

SIAM CSE’09: Scientific Computing on Emerging Many-Core Architectures

March 17th, 2009

Slides are now available for the minisymposium “Scientific Computing on Emerging Many-Core architectures”, held in conjunction with the SIAM Conference on Computational Science and Engineering 2009 (SIAM CSE’09, Miami, Florida). The minisymposium, organised by Mike Giles, Dominik Göddeke and Stefan Turek, focused on opportunities and challenges for scientific computing on novel many-core architectures, in particular IBM’s Cell processor and GPUs from NVIDIA, AMD and Intel. The talks covered a range of application areas, including the development of libraries and other tools to simplify the programming many-core processors. (Minisymposium: Scientific Computing on Emerging Many-Core architectures)

Workshop: Data-Parallel Programming Models for Many-Core Architectures

March 7th, 2007

Data-parallel programming models are emerging as an extremely attractive model for parallel programming, driven by several factors. Through deterministic semantics and constrained synchronization mechanisms, they provide race-free parallel-programming semantics. Furthermore, data-parallel programming models free programmers from reasoning about the details of the underlying hardware and software mechanisms for achieving parallel execution and facilitate effective compilation. Finally, efforts in the GPGPU movement and elsewhere have matured implementation technologies for streaming and data-parallel programming models to the point where high performance can be reliably achieved.

This workshop gathers commercial and academic researchers, vendors, and users of data-parallel programming platforms to discuss implementation experience for a broad range of many-core architectures and to speculate on future programming-model directions. Participating institutions include AMD, Electronic Arts, Intel, Microsoft, NVIDIA, PeakStream, RapidMind, and The University of New South Wales. (Link to Call for Participation, Data-Parallel Programming Models for Many-Core Architectures)