CULA Sparse Now Available

November 10th, 2011

EM Photonics has released CULA Sparse, a ready-to-integrate package for solving sparse linear systems. Features include:

  • Interfaces: C, C++, Fortran, Matlab, Python
  • Platforms: all CUDA platforms. including Linux, Windows, and OS X
  • Solvers and preconditioners: BiCG, BiCGStab, CG, GMRES, MINRES and Jacobi, ILU(0)
  • Data formats: COO, CSR, CSC in double precision real and complex floating point
  • No CUDA programming experience required.

More information is available at http://www.culatools.com/sparse.

Call for papers: CIGPU 2012, Brisbane, Australia, 10-15 June 2012

November 10th, 2011

Submissions are invited for the fifth special session on Computational Intelligence on Consumer Games and Graphics Hardware (CIGPU-2012) to be held in Brisbane, Australia as part of the IEEE World Congress on Computational Intelligence, 10-15 June 2012. More information can be found at http://www.cs.ucl.ac.uk/staff/W.Langdon/cigpu/.

MIDeA: A Multi-Parallel Intrusion Detection Architecture

November 3rd, 2011

Abstract:

Network intrusion detection systems are faced with the challenge of identifying diverse attacks, in extremely high speed networks. For this reason, they must operate at multi-Gigabit speeds, while performing highly-complex per-packet and per-flow data processing. In this paper, we present a multi-parallel intrusion detection architecture tailored for high speed networks. To cope with the increased processing throughput requirements, our system parallelizes network traffic processing and analysis at three levels, using multi-queue NICs, multiple CPUs, and multiple GPUs. The proposed design avoids locking, optimizes data transfers between the different processing units, and speeds up data processing by mapping different operations to the processing units where they are best suited. Our experimental evaluation shows that our prototype implementation based on commodity off-the-shelf equipment can reach processing speeds of up to 5.2 Gbit/s with zero packet loss when analyzing traffic in a real network, whereas the pattern matching engine alone reaches speeds of up to 70 Gbit/s, which is an almost four times improvement over prior solutions that use specialized hardware.

(Giorgos Vasiliadis, Michalis Polychronakis, and Sotiris Ioannidis: “MIDeA: A Multi-Parallel Intrusion Detection Architecture”, Proceedings of the 18th ACM Conference on Computer and Communications Security (CCS), Oct. 2011. [PDF])

23rd International Symposium on Computer Architecture and High Performance Computing – SBAC-PAD’2011

November 2nd, 2011

SBAC-PAD is an annual international conference series, the first of which was held in 1987. Each conference has traditionally presented new developments in high performance applications, as well as the latest trends in computer architecture and parallel and distributed technologies. Authors are invited to submit original manuscripts on a wide range of high-performance computing areas, including computer architecture, systems software, languages and compilers, algorithms, and applications. More information: http://sbac-pad-2011.lsc.ic.unicamp.br/

Parallelization and Characterization of Pattern Matching using GPUs

October 29th, 2011

Abstract:

Pattern matching is a highly computationally intensive operation used in a plethora of applications. Unfortunately, due to the ever increasing storage capacity and link speeds, the amount of data that needs to be matched against a given set of patterns is growing rapidly. In this paper, we explore how the highly parallel computational capabilities of commodity graphics processing units (GPUs) can be exploited for high-speed pattern matching. We present the design, implementation, and evaluation of a pattern matching library running on the GPU, which can be used transparently by a wide range of applications to increase their overall performance. The library supports both string searching and regular expression matching on the NVIDIA CUDA architecture. We have also explored the performance impact of different types of memory hierarchies, and present solutions
to alleviate memory congestion problems. The results of our performance evaluation using off-the-self graphics processors demonstrate that GPU-based pattern matching can reach tens of gigabits per second on different workloads.

(Giorgos Vasiliadis, Michalis Polychronakis and Sotiris Ioannidis: “Parallelization and Characterization of Pattern Matching using GPUs”, Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). November 2011. [PDF])

Physically based lighting for volumetric data with Exposure Render

October 27th, 2011

Exposure Render is a Direct Volume Rendering Application that applies progressive Monte Carlo raytracing, coupled with physically based light transport to heterogeneous volumetric data. Exposure Render enables the configuration of any number of arbitrarily shaped area lights, models a real-world camera, including its lens and aperture, and incorporates complex materials, whilst still maintaining interactive display updates. It features both surface and volumetric scattering, and applies noise reduction to remove the unwanted startup noise associated with progressive Monte Carlo rendering. The complete implementation is available in source and binary forms under a permissive free software license.

SIMD Re-convergence at Thread Frontiers: A new method for handling branch divergence on GPUs

October 24th, 2011

Abstract:

Hardware and compiler techniques for mapping data-parallel programs with divergent control flow to SIMD architectures have recently enabled the emergence of new GPGPU programming models such as CUDA,  OpenCL, and DirectX Compute. The impact of branch divergence can be quite different depending upon whether the program’s control flow is structured or unstructured. In this paper, we show that unstructured control flow occurs frequently in applications and can lead to significant code expansion when executed using existing approaches for handling branch divergence. This paper proposes a new technique for automatically mapping arbitrary control flow onto SIMD processors that relies on a concept of a “Thread Frontier”, which is a statically bounded region of the program
containing all threads that have branched away from the current warp. This technique is evaluated on a GPU emulator configured to model i) a commodity GPU (Intel Sandybridge), and ii) custom hardware support not realized in current GPU architectures. It is shown that this new technique performs identically to the best existing method for structured control flow, and re-converges at the earliest possible point when executing unstructured control flow. This leads to i) between 1.5-633.2% reductions in dynamic instruction counts for several real applications, ii) simplification of the compilation process, and iii) ability to efficiently add high level unstructured programming constructs (e.g., exceptions) to existing data-parallel languages.

(Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu and Sudhakar Yalamanchili: “SIMD Re-convergence at Thread Frontiers”. 44th International Symposium on Microarchitecture (MICRO 44), 2011. [WWW])

Efficient Synchronization Primitives for GPUs

October 22nd, 2011

Abstract:

In this paper, we revisit the design of synchronization primitives—specifically barriers, mutexes, and semaphores—and how they apply to the GPU. Previous implementations are insufficient due to the discrepancies in hardware and programming model of the GPU and CPU. We create new implementations in CUDA and analyze the performance of spinning on the GPU, as well as a method of sleeping on the GPU, by running a set of memory-system benchmarks on two of the most common GPUs in use, the Tesla- and Fermi-class GPUs from NVIDIA. From our results we define higher-level principles that are valid for generic many-core processors, the most important of which is to limit the number of atomic accesses required for a synchronization operation because atomic accesses are slower than regular memory accesses. We use the results of the benchmarks to critique existing synchronization algorithms and guide our new implementations, and then define an abstraction of GPUs to classify any GPU based on the behavior of the memory system. We use this abstraction to create suitable implementations of the primitives specifically targeting the GPU, and analyze the performance of these algorithms on Tesla and Fermi. We then predict performance on future GPUs based on characteristics of the abstraction. We also examine the roles of spin waiting and sleep waiting in each primitive and how their performance varies based on the machine abstraction, then give a set of guidelines for when each strategy is useful based on the characteristics of the GPU and expected contention.

(Jeff A. Stuart and John D. Owens: “Efficient Synchronization Primitives for GPUs”, submitted October 2011. [ARXIV]).

rCUDA 3.1 Released

October 20th, 2011

The new version 3.1 of rCUDA (Remote CUDA), the Open Source package that allows performing CUDA calls to remote GPUs, is now available. Release highlights:

  • Fully updated API to CUDA 4.0 (added support for modules “Peer Device Memory Access” and “Unified Addressing”).
  • Fixed low level Surface Reference management functions.

For further information, please visit the rCUDA webpage  at http://www.gap.upv.es/rCUDA.

Symscape Releases Caedium v3.0 with GPU Support

October 20th, 2011

The latest release of Symscape’s Caedium (v3.0) now has support for CFD simulations using NVIDIA CUDA GPU devices on Windows and Linux. Caedium is an integrated simulation environment that targets Computational Fluid Dynamics (CFD). The GPU support is provided by Symscape’s ofgpu linear solver library for OpenFOAM®. For more details see:
http://www.symscape.com/news/hybrid-cfd-modeling-cloud-computing

Page 4 of 85« First...23456...102030...Last »