Ocelot, developed at Georgia Tech, seeks to develop a set of tools that enable the low level analysis of GPGPU applications as well a providing a JIT compiler for generic architectures. Ocelot currently provides an implementation of the NVIDIA CUDA runtime, capable of running the entire CUDA 2.2 and 2.1 SDKs.
Ocelot features include a memory checker similar to valgrind, detection mechanisms for non-coalesced memory accesses, full device emulation, and a number of useful debugging and performance tuning features. The Roadmap lists future developments.
Ocelot is available at google code, and a number of papers have been published.