Legacy GPGPU Frequently Asked Questions

This is the legacy “official GPGPU Frequently Asked Questions”.  This document was last updated years ago.  While much of it may still be applicable to GPGPU programming using graphics APIs, it is considerably out of date.

General-Purpose computation on Graphics Processing Units

OpenGL Resources:

Direct3D Resources:

You can get lots of information about the architecture of NVIDIA GeForce 6 series (NV4x) hardware from Chapter 30 of GPU Gems 2 and from 3DCenter’s “Inside NVIDIA NV40” article. NVIDIA GeForce 7 series (G70) hardware is similar to NV4x in terms of basic structure and capabilities.

Older NVIDIA GeForce FX series (NV3x) hardware is described in a number of articles, including 3DCenter’s “CineFX Inside” article.

TODO: Add information about ATI Radeon 9×00 and x800 GPUs.

Mark Harris’s chapter from GPU Gems 2, “Chapter 31: Mapping Computational Concepts to GPUs”, describes the various basic GPU programming constructs and relates them to CPU programming idioms with which you are likely already familiar. Look there for an excellent overview of the answers to this question.

Current GPGPU programming must use graphics APIs: OpenGL and Direct3D. OpenGL tends to be favored in the academic community due to the platform portability it allows and due to its extension mechanism, which lets vendors add new features to the API as soon as the hardware supports those features (rather than waiting on Microsoft to release a new version of DirectX). DirectX/Direct3D, on the other hand, tends to be favored in the computer game industry, where dependence on Windows is not a particular impediment. In practice, either API works perfectly well for GPGPU, and which one you use is simply a matter of personal preference.

A few GPGPU-friendly streaming languages, such as Brook, Sh, and Microsoft’s Accelerator have been developed to insulate developers from the graphics APIs as much as possible. Brook is actively supported in the GPGPU.org Forums. Sh has evolved into a commercial effort (with a very unrestricted academic license) called RapidMind, targetting multicore CPUs, the Cell and GPUs with one programming model.

NVIDIA’s CUDA (compute unified device architecture) is GeForce 8 Series’ API for GPGPU programming. As it is not a graphic based API it has several less constraints for programs as well as cleaner code. AMD’s CTM (close to the metal) is an approach that enables low-level efficient GPU programming without any graphics overhead. The Brook compiler (the CVS version) has a CTM backend.

Please visit GPGPU.org Developer Resources.

Current GPUs generally operate on 32-bit floating point values which are stored in the IEEE 754 single-precision standard format. However, while the storage format is the same, the arithmetic operations performed by the GPU might not behave exactly as per the IEEE 754 spec. See the FAQ entry on GPU floating point precision for more information on this.

NVIDIA and ATI GPUs also provide a half-precision floating point (fp16) data storage format. Some NVIDIA GPUs provide specialized fp16 and fixed-precision (fx12) arithmetic in addition to the fp16 storage format. Use of these reduced-precision formats on either NVIDIA or ATI GPUs can provide speedups in certain situations (see the NVIDIA GPU Programming Guide for details of NVIDIA floating point performance).

No current GPUs provide double-precision floating point storage or arithmetic, though several research projects have sought to emulate double-precision.

General 32-bit integer storage formats and integer arithmetic have been added with the G80 and R600 chips.

…to read from?

Buffers used for reading on the GPU are referred to as textures. You can create one in OpenGL using glTexImage2D() (for 2D textures). To efficiently download data from the CPU into an existing 2D texture, use glTexSubImage2D(). Textures are limited to certain sizes (up to 4096 values in each dimension, additionally limited by GPU memory constraints), so most applications rely on a 1D to 2D mapping: A 1D array on the CPU of length N becomes a sqrt(N)xsqrt(N) texture on the GPU.

…to write (render) to (Render-to-texture)?

An OpenGL extension called EXT_framebuffer_object allows you to take a texture object (created with glTexImage2D() as above) and “attach” it to a framebuffer (a renderable entity). This provides a lightweight mechanism to “divert” the output of a rendering / computation pass away from the framebuffer for immediate display, but into a texture which can be used again as input for subsequent passes. See Aaron Lefohn’s framebuffer object sourcecode for a framebuffer object class and example application.

An older OpenGL extension called “pbuffers” provided a more heavy-weight way to accomplish all of this.

A program that uses Direct3D can create a texture, specifying that it will be used as a render target, and then render into it by calling SetRenderTarget().

… to both read and write?

Texture read operations are routed through a cache structure to allow efficient exploitation of reference locality. Read operations also factor into shader thread scheduling as read operations that miss the cache will block the thread and cause other ready threads to run while waiting for the read to memory to complete. Writes are routed through different logic that does some amount of buffering to maximize the efficieny of writes. Memory reads and writes are block-oriented, with each transaction typically involving 32 bytes per memory chip, so maximizing the number valid bytes in a transaction improves system performance. Finally, supporting read-modify-write operations for depth buffering or blending requires careful pipelining and scheduling of memory operations to maximize efficiency of the memory system.

In current GPUs the subsystems for texture reading and pixel writing are independent and meet at the shared memory controller. Therefore trying to concurrently read and write to the same memory locations involves accesses to caches and write buffers that are unsynchronized with each other and may result in unpredictable behavior depending on the current state of those caches and buffers. In some cases it may appear to work, for example when doing sequential accesses where a read is followed by a write at sequential addresses since this is unlikely to cause a conflicting state in the cache and buffers. However doing random rather than sequential accesses will invariably produce unintended results.

To work around this problem, separate read and write targets are used, swapping the read and write target (so that the write target is read next) after completing each “pass” (ping-ponging).

… and where can I read more about this?

GPU memory model overview: Siggraph 2005 slides by Aaron Lefohn

As explained above, Render-to-texture (RTT) is the GPU equivalent of a feedback loop. In the first pass, the output of a computation is directed to a “write-only” texture, and in the subsequent pass, this texture is bound “read-only” as input for the computation. The process of alternately reading and writing from such textures (or a double-buffered pBuffer) is called ping-ponging.

In Direct3D, this is done with SetRenderTarget(texture), in OpenGL, glDrawBuffer() is used for a texture attached to a EXT_framebuffer_object.

This typically involves the following steps:

  • Setting up the viewport for a 1:1 pixel-to-texel mapping.
  • Creating and binding textures containing the input data, downloading some data into these textures.
  • Binding a fragment program that serves as the “computational kernel”.
  • Rendering some geometry (usually a screen-sized quad to make sure the “kernel” gets executed once per pixel aka texel).

Most GPGPU results are computed through a Render-to-texture approach. Further processing depends on the application, generally, there are two ways to achieve this:

  • Read back the texture that contains the results to the CPU using OpenGL’s glReadPixels() or Direct3D’s GetRenderTargetData(). Note that you must first attach the texture as a renderable surface before initiating the readback (or else use glGetTexImage() instead of glReadPixels()).
  • Use the texture to render a textured screen-sized quad for display.

Please refer to this tutorial.

There are many possible reasons for this; basically “something is messed up.”

OpenGL silently stores errors internally until you call glGetError(), which returns the last error that happened (or GL_NO_ERROR). During development, it’s a good idea to call glGetError() after every chunk of code, since a bad parameter can mess up almost any call, which then has no effect aside from internally storing the error. This can be frustrating, because it’s tough to figure out what’s happening when part of your code is totally ignored.

Another common reason is that your texture objects are not “texture complete”, meaning that some of the GL state for them is incorrect. See glTexParameteri(). One common problem is that the default texture minification mode (GL_TEXTURE_MIN_FILTER) is GL_NEAREST_MIPMAP_LINEAR, so if you don’t allocate mipmaps for a texture it is by default texture incomplete.

Other common problems are in matrix setup (geometry is being drawn, but misses the screen), Z-buffer setup (fragments are outside valid depth range), alpha blending setup (many high-precision framebuffer formats do not support blending, so use glDisable(GL_BLEND)), scissor, stencil, or viewport tests.

In general it is useful to start very simple, get something on the screen, and then incrementally add features while keeping working backup copies. Then when something fails, you can compare the last-working and current versions. Also, changing your fragment shader to return a constant color (such as bright red, vec4(1,0,0,1)) can help determine if your code is even running.

This is a fairly standard concept in graphics called multitexturing. It works no differently in GPGPU than in regular graphics apps. In OpenGL, the functions of possible interest are glActiveTexture() and glMultiTexCoord*(). This was originally introduced by the ARB_multitexture extension, which was later promoted into the core GL as of OpenGL 1.2.1.

You can accomplish this by calling glDrawBuffers(), which is a new function in OpenGL 2.0. It used to be called glDrawBuffersATI(), which was introduced by the ATI_draw_buffers OpenGL extension. This is called MRT (Multiple Render Targets), and is supported by ATI 9×00 and X8xx (and newer) hardware and NV4x (and newer) NVIDIA hardware.

Similar functionality is available in Direct3D9 using SetRenderTarget to set any of the MRTs.

The FBO extension makes it really easy and intuitive to write applications that write to multiple render targets. Unfortunately, FBO makes MRT so easy that people frequently forget that MRT must be explicitly enabled; simply attaching multiple color attachments to the FBO is not sufficient by itself.

A proper example combining FBO and MRT would be something like this:

Let’s assume you have textures attached to GL_COLOR_ATTACHMENT0_EXT and GL_COLOR_ATTACHMENT1_EXT:

// Set up tex0 and tex1 for render-to-texture
glBindFramebufferEXT(GL_FRAMEBUFFER_EXT, fb);

Then you just call glDrawBuffers(), passing in the color buffer names. Something like:

glDrawBuffers(2, buffers);

Then in your shader, just output to the first two color outputs (in Cg, those would be the ones bound to the COLOR0 and COLOR1 semantics).

Note that the COLOR0 and COLOR1 Cg semantics correspond to the first and second enumerants that were passed in to glDrawBuffers, respectively. In this example, it just so happens that COLOR0 => GL_COLOR_ATTACHMENT0_EXT and COLOR1 => GL_COLOR_ATTACHMENT1_EXT. But the correspondence is by way of the order in buffers rather than by any notion of COLOR0 always corresponding to GL_COLOR_ATTACHMENT0_EXT and so on, which is not the case.

See also How do I write to more than one render target at once? above.

Current NVIDIA graphics hardware provides 32-bit (s23e8) floating point arithmetic that is very similar to the arithmetic specified by the IEEE 754 standard, but not quite the same. The storage format is the same (see above), but the arithmetic might produce slightly different results. For example, on NVIDIA hardware, some rounding is done slightly differently, and denormals are typically flushed to zero.

Current ATI hardware does all its floating point arithmetic at 24-bit precision (s15e8), even though it stores values in the IEEE standard 32-bit format.

Both NVIDIA and ATI provide a “half-precision” 16-bit (s10e5) floating point storage format; some NVIDIA GPUs can perform half-precision arithmetic more quickly than single-precision arithmetic.

No GPU currently provides double-precision storage or double-precision arithmetic natively in hardware. There are several ongoing efforts to emulate double precision through a single-double approach (doubling the mantissa) and CPU-GPU interplay.

For a look at the exact precision of various 32-bit GPU floating point operations (and an example program to test your own hardware), refer to Karl Hillesland’s whitepaper GPU Floating-Point Paranoia.

For a comprehensive guide to floating point arithmetic, refer to David Goldberg’s What Every Computer Scientist Should Know About Floating-Point Arithmetic.

The answer depends on which API it is (Direct3D or OpenGL) and on what kind of texture it is. Direct3D addresses texels at their top-left corner. OpenGL addresses texel centers.

In OpenGL, most texture coordinates are normalized to the range (0..1) in each dimension, with the exception that texture rectangles (GL_TEXTURE_RECTANGLE_ARB) use unnormalized coordinates ranging from (0..width, 0..height).

The centers of OpenGL texels along the diagonal of a GL_TEXTURE_RECTANGLE_ARB are therefore(0.5, 0.5), (1.5, 1.5), (2.5, 2.5), etc. (To see this try writing WPOS to the output of a shader on a screen-aligned quad.)

To get the corresponding texture coordinates for GL_TEXTURE_2D textures, you have to normalize them, meaning you divide by width and height. For a 4-by-4 GL_TEXTURE_2D, that would give us texel centers along the diagonal of (0.125,0.125), (0.375,0.375), (0.625,0.625), and (0.875,0.875).

You need to apply the patches Andrew Wood posted on sourceforge.

  • Make sure to do enough iterations/passes of the bit of (rendering) code you want to clock, and take the average time.
  • Call glFinish() before you start the timer, and before you stop the timer.
  • Make sure to have the textures already in the GPU memory (e.g. by texturing a “dummy quad” beforehands).
  • Take special care of the data dependencies to make sure the optimizer inside the driver (which you cannot influence) doesn’t trick you.
  • Check the GPUbench source code for examples, most basic tests are already there.
  • If you are trying to time individual GL calls, be aware that most rendering calls are non-blocking, which can complicate your attempts to profile your app. Render time might be unfairly attributed to whichever call happens to fill up the render queue and therefore blocks while a flush of the queue occurs.