New Embedded GPU Platform for General-Purpose Computing Delivers the Highest Performance per Energy or Area

March 5th, 2014

From a recent press release:

The versatile Nema™ Platform for General-Purpose Computing on an embedded GPU (GPGPU) is designed by Think Silicon for excellent performance with ultra-low energy consumption and silicon footprint, and is available now from CAST, Inc.

Designed by graphics processing experts Think Silicon Ltd., the Nema GPU is a scalable, many-core, multi-threaded, state-of-the-art, data processing design blending both graphics rendering and general computing capabilities. It offers easy configuration, rapid programming, and straightforward system integration in a reusable soft IP core suitable for ASIC or FPGA implementation.

Read the rest of this entry »

NVIDIA Kepler GK110 Architecture White Paper

May 20th, 2012

NVIDIA Kepler GK110 Die Shot

This white paper describes the new Kepler  GK110 Architecture from NVIDIA.

Comprising 7.1 billion transistors, Kepler GK110 is not only the fastest, but also the most architecturally complex microprocessor ever built. Adding many new innovative features focused on compute performance, GK110 was designed to be a parallel processing powerhouse for Tesla® and the HPC market.

Kepler GK110 will provide over 1 TFlop of double precision throughput with greater than 80% DGEMM efficiency versus 60‐65% on the prior Fermi architecture.

In addition to greatly improved performance, the Kepler architecture offers a huge leap forward in power efficiency, delivering up to 3x the performance per watt of Fermi.

The paper describes features of the Kepler GK110 architecture, including

  • Dynamic Parallelism;
  • Hyper-Q;
  • Grid Management Unit;
  • NVIDIA GPUDirect™;
  • New SHFL instruction and atomic instruction enhancements;
  • New read-only data cache previously only accessible to texture;
  • Bindless Textures;
  • and much more.

An Analysis of the GPU Market

September 10th, 2011

From the abstract of a GPU market analysis whitepaper by John Peddie Research:

Computer graphics is hard work. Behind the images you see in games and movies, or while editing photos or video, some serious processing is taking place. All the processing power you can muster is needed to push and polish pixels. And this task is only going to get more demanding as these applications get more sophisticated. Graphics Processing Units (GPUs), which do the heavy lifting in computer graphics, range greatly in size, price and performance. They span from tiny cores inside an ARM processor (such as Nvidia’s Tegra or Qualcomm’s Snapdragon), to graphics integrated within an X86 processor (such as AMD’s Fusion, Intel’s Sandy Bridge), to a standalone discrete device, or dGPU (such as AMD’s Radeon, or Nvidia’s GeForce).

More information:

Understanding throughput-oriented architectures

November 24th, 2010


For workloads with abundant parallelism, GPUs deliver higher peak computational throughput than latency-oriented CPUs. Key insights of this article: Throughput-oriented processors tackle problems where parallelism is abundant, yielding design decisions different from more traditional latency oriented processors. Due to their design, programming throughput-oriented processors requires much more emphasis on parallelism and scalability than programming sequential processors. GPUs are the leading exemplars of modern throughput-oriented architecture, providing a ubiquitous commodity platform for exploring throughput-oriented programming.

(Michael Garland and David B. Kirk, “Understanding throughput-oriented architectures”, Commununications of the ACM 53(11), 58-66, Nov. 2010. [DOI])

State-of-the-Art in Heterogeneous Computing

May 13th, 2010


Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs).We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.

(A. R. Brodtkorb, C. Dyken, T. R. Hagen, J. M. Hjelmervik and O. O. Storaasli: “State-of-the-Art in Heterogeneous Computing”, IOS Press, 18(1) (2010), pp. 1-33. Link to PDF)

NVIDIA Launches First Fermi GPUs, the GeForce GTX 400 series

March 31st, 2010

The first GPUs to feature NVIDIA’s new Fermi architecture, the GeForce GTX 480 and 470 GPUs have 480 and 448 CUDA cores, respectively.  From an NVIDIA press release:

SANTA CLARA, California—March 29, 2010—Hot off the heels of PAX East, the consumer gaming show held this past weekend in Boston, NVIDIA today officially launched its new flagship graphics processors, the NVIDIA® GeForce® GTX 480 and GeForce GTX 470.

The top-of-the line in a new family of enthusiast-class GPUs, the GeForce GTX 480 was designed from the ground up to deliver the industry’s most potent tessellation performance, which is the key component of Microsoft’s DirectX 11 development platform for PC games. Tessellation allows game developers to take advantage of the GeForce GTX 480 GPU’s ability to increase the geometric complexity of models and characters to deliver far more realistic and visually compelling gaming environments.

The GeForce GTX 480 is joined by the GeForce GTX 470 as the first products in NVIDIA’s Fermi line of consumer products. They will be available in mid-April, from the world’s leading add-in card partners and PC system builders. The remainder of the GeForce 400-series lineup will be announced in the coming months, filling out additional performance and price segments.

The GeForce GTX 480 and GTX 470 GPUs bring a host of new gaming features never before offered for the PC – including support for real-time ray tracing and NVIDIA 3D Vision™ Surround for truly immersive widescreen, stereoscopic 3D gaming.

NVIDIA Announces Next-Generation CUDA GPU Architecture – Codenamed “Fermi”

October 1st, 2009

On September 30th NVIDIA unveiled its latest GPU architecture, codenamed “Fermi”.  The first Fermi GPUs will contain 512 “CUDA Cores”, capable of more than 8x the double precision floating-point throughput of its predecessor, the GT200 GPU.  The GPU also incorporates error correcting (ECC) memories and caches, a new cache hierarchy, increased shared memory and register file sizes, and the ability to execute C++ programs.

From the press release:

SANTA CLARA, Calif. -Sep. 30, 2009- NVIDIA Corp. today introduced its next generation CUDA™ GPU architecture, codenamed “Fermi”. An entirely new ground-up design, the “Fermi”™ architecture is the foundation for the world’s first computational graphics processing units (GPUs), delivering breakthroughs in both graphics and GPU computing.

“NVIDIA and the Fermi team have taken a giant step towards making GPUs attractive for a broader class of programs,” said Dave Patterson, director Parallel Computing Research Laboratory, U.C. Berkeley and co-author of Computer Architecture: A Quantitative Approach. “I believe history will record Fermi as a significant milestone.”

Presented at the company’s inaugural GPU Technology Conference, in San Jose, California, “Fermi” delivers a feature set that accelerates performance on a wider array of computational applications than ever before. Joining NVIDIA’s press conference was Oak Ridge National Laboratorywho announced plans for a new supercomputer that will use NVIDIA® GPUs based on the “Fermi” architecture. “Fermi” also garnered the support of leading organizations including Bloomberg, Cray, Dell, HP, IBM and Microsoft.

Read the rest of this entry »

ATI Radeon™ HD 5800 Series Announced By AMD

October 1st, 2009

AMD announced its latest ATI Radeon™ series of graphics cards on September 23rd.  The new GPUs boast up to 2.72 GFLOP/s of single-precision floating point throughput, along with DirectX® 11 graphics (including DirectCompute) and OpenCL 1.0 support.

From the press release:

AMD (NYSE: AMD) today launched the most powerful processor ever created1, found in its next-generation graphics cards, the ATI Radeon™ HD 5800 series graphics cards, and the world’s first and only to fully support Microsoft DirectX® 112, the new gaming and compute standard shipping shortly with Microsoft Windows® 7operating system. Boasting up to 2.72 TeraFLOPS of compute power, the ATI Radeon™ HD 5800 series effectively doubles the value consumers can expect of their graphics purchases, delivering twice the performance-per-dollar of previous generations of graphics products.3 AMD will initially release two cards: the ATI Radeon HD 5870 and the ATI Radeon HD 5850, each with 1GB GDDR5 memory. With the ATI Radeon™ HD 5800 series of graphics cards, PC users can expand their computing experience with ATI Eyefinity multi-display technology, accelerate their computing experience with ATI Stream technology, and dominate the competition with superior gaming performance and full support of Microsoft DirectX® 11, making it a “must-have” consumer purchase just in time for Microsoft Windows® 7 operating system.

Read the rest of this entry »

Larrabee: A Many-Core x86 Architecture for Visual Computing

August 12th, 2008


This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as some fixed function logic blocks. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads. It also greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. A coherent on-die 2nd level cache allows efficient inter-processor communication and high-bandwidth local data access by CPU cores. Task scheduling is performed entirely with software in Larrabee, rather than in fixed function logic. The customizable software graphics rendering pipeline for this architecture uses binning in order to reduce required memory bandwidth, minimize lock contention, and increase opportunities for parallelism relative to standard GPUs. The Larrabee native programming model supports a variety of highly parallel applications that use irregular data structures. Performance analysis on those applications demonstrates Larrabee’s potential for a broad range of parallel computation
(Larrabee: A Many-Core x86 Architecture for Visual Computing. Seiler, L., Carmean, D., Sprangle, D., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P. Proceedings of SIGGRAPH 2008.)


NVIDIA Tesla wins PC Magazine Technical Excellence Award

December 14th, 2007

NVIDIA’s new Tesla GPU Computing line of GPUs have won a PC Magazine Technical Excellence Award in the Component category. From the PC Magazine article:

Sure, you know GPUs, but have you heard of GPGPUs? The concept is simple: Use the massively parallel architecture of the graphics processor for general-purpose computing tasks. Because of that parallelism, ordinary calculations can be dramatically sped up. To create the Tesla, its powerful new entry into this market, NVIDIA has bundled multiple GPUs (without video connectors!) into either a board or a desk-side box that offers near-supercomputer levels of single-precision floating-point operations. The general-purpose GPU (thus the acronym GPGPU) is being used as a high-performance coprocessor for climate modeling, oil and gas exploration, and other applications—and it’s much cheaper than a supercomputer. The Tesla even comes complete with its own C compiler and tools.