Integrating CUDA and GNU Autotools

November 17th, 2011 has released a step by step guide to integrating CUDA with GNU Autotools. The guide covers building stand alone CUDA binaries, static CUDA libraries, shared CUDA libraries and comes with an example tarball. For more information go to

Parallel Accelerating for Star Catalogue Retrieval Algorithm using GPUs

November 16th, 2011


A GPU-based parallel star retrieval method is proposed to improve the efficiency of searching stars from star catalogue in computer simulation, especially when the FOV (Field of View) is large. By the novel algorithm, the stars in catalogue are classified and stored in different zones using latitude and longitude zoning method firstly. Based on the easily accessible star catalogue, the star zones that FOV covers can be computed exactly by constructing a spherical triangle around the FOV. As a result, the searching scope is reduced effectively. Finally, we use CUDA computation architecture to run the process of star retrieving from those star zones parallel on GPU. Experimental results show that, in comparison with CPU-oriented implementation, the proposed algorithm achieves up to tens of times speedup, and the processing time is limited within a millisecond level in large FOV and wide star magnitude span. It meets the requirement of real-time simulation.

(Chao Li, Liqiang Zhang, Jiaze Wu, and Changwen Zheng, “Parallel Accelerating for Star Catalogue Retrieval Algorithm using GPUs”, Journal of Astronautics, 2012)

A fast algorithm of simulating star map for star sensor

November 16th, 2011


In order to test the function and performance of star sensor on the ground, a fast method for simulating star map is presented. The algorithm adopts instantanesous coordinate of star and improves the star searching efficiency by optimizing the zone partitioning method for star catalogue. We overcome the low accuracy of the latitude and longitude’s span that FOV overlays by proposing a new spherical right-angled triangle method and the searching scope is reduced highly; meanwhile, the simulation model for star brightness is also built based on adopted star catalogue. Simulation study is conducted for the demonstration of the algorithm. The proposed approach meets the requirement of wide magnitude range and short simulation period.

(Chao Li, Changwen Zheng, Jiaze Wu, and Liqiang Zhang, “A fast algorithm of simulating star map for star sensor”, Proceedings of the 3rd IEEE International Conferernce on Computer and Network Technology (IEEE ICCNT), 2011)

Accelerating GPU Kernels for Dense Linear Algebra

November 14th, 2011


Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA) libraries, and therefore have to be highly optimized. We present some techniques and implementations that significantly accelerate the corresponding routines from currently available libraries for GPUs. In particular, Pointer Redirecting – a set of GPU specific optimization techniques –allows us to easily remove performance oscillations associated with problem dimensions not divisible by fixed blocking sizes. For example, applied to the matrix-matrix multiplication routines, depending on the hardware configuration and routine parameters, this can lead to two times faster algorithms. Similarly, the matrix-vector multiplication can be accelerated more than two times in both single and double precision arithmetic. Additionally, GPU specific acceleration techniques are applied to develop new kernels (e.g. syrk, symv) that are up to 20x faster than the currently available kernels. We present these kernels and also show their acceleration e!ect to higher level dense linear algebra routines. The accelerated kernels are now freely available through the MAGMA BLAS library.

(R. Nath, S. Tomov and J. Dongarra: “Accelerating GPU Kernels for Dense Linear Algebra”, VECPAR 2010. [PDF])

An Improved MAGMA GEMM For Fermi Graphics Processing Units

November 14th, 2011


We present an improved matrix–matrix multiplication routine (General Matrix Multiply [GEMM]) in the MAGMA BLAS library that targets the NVIDIA Fermi graphics processing units (GPUs) using Compute Unified Data Architecture (CUDA). We show how to modify the previous MAGMA GEMM kernels in order to make a more efficient use of the Fermi’s new architectural features, most notably their extended memory hierarchy and memory sizes. The improved kernels run at up to 300 GFlop/s in double precision and up to 645 GFlop/s in single precision arithmetic (on a C2050), which is correspondingly 58% and 63% of the theoretical peak. We compare the improved kernels with the currently available version in CUBLAS 3.1. Further, we show the effect of the new kernels on higher-level dense linear algebra (DLA) routines such as the one-sided matrix factorizations, and compare their performances with corresponding, currently available routines running on homogeneous multicore systems.

(R. Nath and S. Tomov and J. Dongarra: “An Improved MAGMA GEMM For Fermi Graphics Processing Units”,  International Journal of High Performance Computing Applications. 24(4), 511-515, 2010. [DOI] [PREPRINT])

Call for Papers: GPGPU-5

November 13th, 2011

Paper submission is now open for GPGPU5 which will be held March 3, 2012 in London, UK, and co-located with ACM ASPLOS XVII. The goal of this workshop is to provide a forum to discuss new and emerging general-purpose purpose programming environments and platforms, as well as evaluate applications that have been able to harness the horsepower provided by these platforms. This year’s work is particularly interested on new heterogeneous GPU platforms. For more information, visit:

CULA Sparse Now Available

November 10th, 2011

EM Photonics has released CULA Sparse, a ready-to-integrate package for solving sparse linear systems. Features include:

  • Interfaces: C, C++, Fortran, Matlab, Python
  • Platforms: all CUDA platforms. including Linux, Windows, and OS X
  • Solvers and preconditioners: BiCG, BiCGStab, CG, GMRES, MINRES and Jacobi, ILU(0)
  • Data formats: COO, CSR, CSC in double precision real and complex floating point
  • No CUDA programming experience required.

More information is available at

Call for papers: CIGPU 2012, Brisbane, Australia, 10-15 June 2012

November 10th, 2011

Submissions are invited for the fifth special session on Computational Intelligence on Consumer Games and Graphics Hardware (CIGPU-2012) to be held in Brisbane, Australia as part of the IEEE World Congress on Computational Intelligence, 10-15 June 2012. More information can be found at

MIDeA: A Multi-Parallel Intrusion Detection Architecture

November 3rd, 2011


Network intrusion detection systems are faced with the challenge of identifying diverse attacks, in extremely high speed networks. For this reason, they must operate at multi-Gigabit speeds, while performing highly-complex per-packet and per-flow data processing. In this paper, we present a multi-parallel intrusion detection architecture tailored for high speed networks. To cope with the increased processing throughput requirements, our system parallelizes network traffic processing and analysis at three levels, using multi-queue NICs, multiple CPUs, and multiple GPUs. The proposed design avoids locking, optimizes data transfers between the different processing units, and speeds up data processing by mapping different operations to the processing units where they are best suited. Our experimental evaluation shows that our prototype implementation based on commodity off-the-shelf equipment can reach processing speeds of up to 5.2 Gbit/s with zero packet loss when analyzing traffic in a real network, whereas the pattern matching engine alone reaches speeds of up to 70 Gbit/s, which is an almost four times improvement over prior solutions that use specialized hardware.

(Giorgos Vasiliadis, Michalis Polychronakis, and Sotiris Ioannidis: “MIDeA: A Multi-Parallel Intrusion Detection Architecture”, Proceedings of the 18th ACM Conference on Computer and Communications Security (CCS), Oct. 2011. [PDF])

23rd International Symposium on Computer Architecture and High Performance Computing – SBAC-PAD’2011

November 2nd, 2011

SBAC-PAD is an annual international conference series, the first of which was held in 1987. Each conference has traditionally presented new developments in high performance applications, as well as the latest trends in computer architecture and parallel and distributed technologies. Authors are invited to submit original manuscripts on a wide range of high-performance computing areas, including computer architecture, systems software, languages and compilers, algorithms, and applications. More information:

Page 30 of 111« First...1020...2829303132...405060...Last »