Back Testing of HFT Strategies with Xcelerit and GPUs

July 26th, 2013

Algorithmic trading has become ever more popular in recent years – accounting for approximately half of all European and American stock trades placed in 2012. The trading strategies need to be back-tested regularly using historical market data for calibration and to check the expected return and risk. This is a computationally demanding process that can take hours to complete. However, back-testing the strategies frequently intra-day can significantly increase the profits for the trading institution.

Read the rest of this entry »

CUDA 5 Production Release Now Available

October 15th, 2012

The CUDA 5 Production Release is now available as a free download at
This powerful new version of the pervasive CUDA parallel computing platform and programming model can be used to accelerate more of applications using the following four (and many more) new features.

• CUDA Dynamic Parallelism brings GPU acceleration to new algorithms by enabling GPU threads to directly launch CUDA kernels and call GPU libraries.
• A new device code linker enables developers to link external GPU code and build libraries of GPU functions.
• NVIDIA Nsight Eclipse Edition enables you to develop, debug and optimize CUDA code all in one IDE for Linux and Mac OS.
• GPUDirect Support for RDMA provides direct communication between GPUs in different cluster nodes

As a demonstration of the power of Dynamic Parallelism and device code linking, CUDA 5 includes a device-callable version of the CUBLAS linear algebra library, so threads already running on the GPU can invoke CUBLAS functions on the GPU. Read the rest of this entry »

Libra 1.2 includes new OpenCL back end

June 8th, 2010

GPU Systems LogoGPU Systems has added an OpenCL back end implementation to its Libra Technology compiler and runtime architecture. Libra version 1.2 now supports x86/x64, OpenGL/OpenCL and CUDA compute back ends. The OpenCL back end generates dynamic code specifically for AMD GPUs. Also, the CUDA back end generator has been enhanced with Fermi capabilities and this new release brings full BLAS 1,2,3 matrix, vector, dense, sparse, complex, single/double standard math library functionality and access through a standard C programming interface & library. The high-level approach of the Libra API enables developers to easily extend existing high-level functionality from their favorite programming language.

Read the rest of this entry »

CUDA 3.0 toolkit released

March 20th, 2010

NVIDIA has released version 3.0 of the CUDA Toolkit, providing developers with tools to prepare for the upcoming Fermi-based GPUs. Highlights of this release include:

  • Support for the new Fermi architecture, with:
    • Native 64-bit GPU support
    • Multiple Copy Engine support
    • ECC reporting
    • Concurrent Kernel Execution
    • Fermi HW debugging support in cuda-gdb
    • Fermi HW profiling support for CUDA C and OpenCL in Visual Profiler
  • C++ Class Inheritance and Template Inheritance support for increased programmer productivity
  • A new unified interoperability API for Direct3D and OpenGL, with support for:
    • OpenGL texture interop
    • Direct3D 11 interop support
    • CUDA Driver / Runtime Buffer Interoperability, which allows applications using the CUDA Driver API to also use libraries implemented using the CUDA C Runtime such as CUFFT and CUBLAS.
  • Read the rest of this entry »

Intel acquires RapidMind

August 23rd, 2009

Intel has acquired RapidMind, the company behind the RapidMind (formerly Sh) programming environment targeting multicore CPUs, AMD and NVIDIA GPUs and the Cell processor. The RapidMind Platform continues to be available, including support. In the medium term RapidMind’s technology and products will be integrated with Intel’s data-parallel products, in particular Intel’s Ct technology.

This blog entry by James Reinders from Intel describes the acquisition and future plans in more detail.

Equalizer 0.9

August 17th, 2009

Equalizer 0.9, a framework for creating and deploying parallel, scalable OpenGL applications, has been released. The most notable new features in this release are:

  • Automatic cross-segment load-balancing for multidisplay installations
  • Dynamic Frame Resolution (DFR) for constant-framerate rendering
  • Compression Plugin API for runtime-loadable image compression engines

See the 0.9 release notes on the Equalizer website for a comprehensive list of new features, enhancements, optimizations and bug. A paperback Equalizer Programming and User Guide is available from Commercial support, custom software development and porting services are available from Eyescale Software GmbH.

NVIDIA CUDA Toolkit and SDK version 2.3 Released

July 22nd, 2009

NVIDIA announced today it has released version 2.3 of the CUDA Toolkit and SDK for GPU Computing. This latest release supports several significant new features that deliver a major leap forward in getting the most performance out of NVIDIA’s massively parallel CUDA-enabled GPUs. This release of the CUDA Toolkit includes performance improvements and expanded support for the cuda-gdb hardware debugger.

Additional new features in CUDA Toolkit 2.3 include:

  • The CUFFT Library now supports double-precision transforms and includes significant performance improvements for single-precision transforms as well.  See the CUDA Toolkit release notes for details.
  • The CUDA-GDB hardware debugger and CUDA Visual Profiler are now included in the CUDA Toolkit installer, and the CUDA-GDB debugger is now available for all supported Linux distros.  (see below)
  • Each GPU in an SLI group is now enumerated individually, so compute applications can now take advantage of multi-GPU performance even when SLI is enabled for graphics.
  • The 64-bit versions of the CUDA Toolkit now support compiling 32-bit applications. (See the release notes for details, including changes to LD_LIBRARY_PATH on Linux)
  • New support for fp16 <-> fp32 conversion intrinsics allows storage of data in fp16 format with computation in fp32.  Use of fp16 format is ideal for applications that require higher numerical range than 16-bit integer but less precision than fp32 and reduces memory space and bandwidth consumption.
  • The CUDA SDK has been updated to include: Read the rest of this entry »

Libra SDK: C/C++ for both the CPU and GPU

June 24th, 2009

GPU Systems has announced the Libra SDK, a robustly equipped C/C++ developer kit for fast and easy cross CPU-GPU access suited for scientific computations. The Libra 1.1 SDK includes a C/C++ Matlab-style API, sample programs and documentation. A downloadable trial version of Libra is available from the GPU Systems website, and a Libra demo presentation is also available.

Message Passing on GPUs and Data-Parallel Architectures

March 11th, 2009


This paper explores the challenges in implementing a message passing interface usable on systems with data-parallel processors. As a case study, we design and implement the “DCGN” API on NVIDIA GPUs that is similar to MPI and allows full access to the underlying architecture. We introduce the notion of data-parallel thread-groups as a way to map resources to MPI ranks. We use a method that also allows the data-parallel processors to run autonomously from user-written CPU code. In order to facilitate communication, we use a sleep-based polling system to store and retrieve messages. Unlike previous systems, our method provides both performance and flexibility. By running a test suite of applications with different communication requirements, we find that a tolerable amount of overhead is incurred, somewhere between one and five percent depending on the application, and indicate the locations where this overhead accumulates. We conclude that with innovations in chipsets and drivers, this overhead will be mitigated and provide similar performance to typical CPU based MPI implementations while providing fully-dynamic communication.

(Jeff A. Stuart and John D. Owens, Message Passing on Data-Parallel Architectures, Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium)

GPU Programming For The Rest Of Us

March 11th, 2009

This article by Jeff Layton at ClusterMonkey summarizes the history of GPU Computing in terms of high-level programming languages and abstractions, from the early days of GPGPU programming using graphics APIs, to Stream, CUDA and OpenCL. The second half of the article provides an introduction to the PGI 8.0 Technology Preview, which allows the use of pragmas to automatically parallelize and run compute-intensive kernels in standard C and Fortran code on accelerators like GPUs. (GPU Programming For the Rest Of Us, Jeff Layton,