A Highly Efficient GPU Implementation for Variational Optic Flow Based on the Euler-Lagrange Framework

November 21st, 2010


The Euler-Lagrange (EL) framework is the most widely-used strategy for solving variational optic flow methods. We present the first approach that solves the EL equations of state-of-the-art methods on sequences with 640×480 pixels in near-realtime on GPUs. This performance is achieved by combining two ideas: (i) We extend the recently proposed Fast Explicit Diffusion (FED) scheme to optic flow, and additionally embed it into a coarse-to-fine strategy. (ii) We parallelise our complete algorithm on a GPU, where a careful optimisation of global memory operations and an efficient use of on-chip memory guarantee a good performance. Applying our approach to the variational ‘Complementary Optic Flow’ method (Zimmer et al. (2009)), we obtain highly accurate flow fields in less than a second. This currently constitutes the fastest method in the top 10 of the widely used Middlebury benchmark.

(Pascal Gwosdek, Henning Zimmer, Sven Grewenig, Andrés Bruhn and Joachim Weickert: “A Highly Efficient GPU Implementation for Variational Optic Flow Based on the Euler-Lagrange Framework”, Proceedings of the ECCV Workshop for Computer Vision with GPUs, Sep 2010.) [Project webpage with PDF, sources and additional information]

Submit Applications for NVIDIA Graduate Fellowship

November 18th, 2010

The application period for the NVIDIA Graduate Fellowship Program is now open. We are currently accepting applications for the 2011-2012 academic year. The deadline to apply is 11:59PM PST on February 3, 2011.

NVIDIA has long believed that investing in university talent is beneficial to the industry and key to our continued growth and success. The NVIDIA Graduate Fellowship Program provides funding to Ph.D. students who are researching topics that will lead to major advances in the graphics and high-performance computing industries, and are investigating innovative ways of leveraging the power of the GPU. We select students each year who have the talent, aptitude and initiative to work closely with us early in their careers. Recipients not only receive crucial funding for their research, but are able to conduct groundbreaking work with access to NVIDIA products, technology and some of the most talented minds in the field.

For complete details including application instructions, requirements, benefits, and eligibility, visit the NVIDIA Graduate Fellowship website.

CFP: International Journal of Computer Science and Security (IJCSS)

November 17th, 2010

The International Journal of Computer Science and Security (IJCSS) is a refereed online journal which is a forum for publication of current research in computer science and computer security technologies. It considers any material dealing primarily with the technological aspects of computer science and computer security. The journal is targeted to be read by academics, scholars, advanced students, practitioners, and those seeking an update on current experience and future prospects in relation to all aspects computer science in general but specific to computer security themes. Subjects covered include: access control, computer security, cryptography, communications and data security, databases, electronic commerce, multimedia, bioinformatics, signal processing and image processing etc. Read the rest of this entry »

CfP: 19th High Performance Computing Symposium (HPC 2011)

November 16th, 2010

The 2011 Spring Simulation Multiconference will feature the 19th High Performance Computing Symposium (HPC 2011), devoted to the impact of high performance computing and communications on computer simulations. Advances in multicore and many-core architectures, networking, high end computers, large data stores, and middleware capabilities are ushering in a new era of high performance parallel and distributed simulations. Along with these new capabilities come new challenges in computing and system modeling. The goal of HPC 2011 is to encourage innovation in high performance computing and communication technologies and to promote synergistic advances in modeling methodologies and simulation. It will promote the exchange of ideas and information between universities, industry, and national laboratories about new developments in system modeling, high performance computing and communication, and scientific computing and simulation.

Topics of interest include:

  • high performance/large scale application case studies,
  • GPU, multicore, and many-core analysis and applications,
  • power aware computing,
  • cloud, distributed, and grid computing,
  • asynchronous numerical methods and programming,
  • hybrid system modeling and simulation,
  • visualization and data management,
  • problem solving environments,
  • tools and environments for coupling parallel codes,
  • parallel algorithms and architectures,
  • high performance software tools,
  • resilience at the simulation level,
  • component technologies for high performance computing.

More information can be found on the webpage: http://www.cs.vt.edu/hpc2011/


November 16th, 2010

The goal of this workshop, held in conjunction with ASPLOS XVI (Newport Beach, CA USA, March 5-6 2011) is to provide a forum to discuss new and emerging general-purpose purpose programming environments and platforms, as well as evaluate applications that have been able to harness the horsepower provided by these platforms. This year’s work is particularly interested on new heterogeneous GPU platforms. Papers are being sought on many aspects of GPUs, including (but not limited to):

  • GPU applications + GPU compilation
  • GPU programming environments + GPU power/efficiency
  • GPU architectures + GPU benchmarking/measurements
  • Multi-GPU systems + Heterogeneous GPU platforms

Paper Submission: Authors should submit a 8 page paper in ACM double-column style using the directions on the conference website at http://www.ece.neu.edu/GPGPU.

Organizers: John Cavazos (University of Delaware) and David Kaeli (Northeastern University)

GPU-Accelerated Molecular Modeling Coming Of Age

November 16th, 2010


Graphics processing units (GPUs) have traditionally been used in molecular modeling solely for visualization of molecular structures and animation of trajectories resulting from molecular dynamics simulations. Modern GPUs have evolved into fully programmable, massively parallel co-processors that can now be exploited to accelerate many scientific computations, typically providing about one order of magnitude speedup over CPU code and in special cases providing speedups of two orders of magnitude. This paper surveys the development of molecular modeling algorithms that leverage GPU computing, the advances already made and remaining issues to be resolved, and the continuing evolution of GPU technology that promises to become even more useful to molecular modeling. Hardware acceleration with commodity GPUs is expected to benefit the overall computational biology community by bringing teraflops performance to desktop workstations and in some cases potentially changing what were formerly batch-mode computational jobs into interactive tasks.

John E. Stone, David J. Hardy, Ivan S. Ufimtsev, and Klaus Schulten: “GPU-Accelerated Molecular Modeling Coming of Age”, Journal of Molecular Graphics and Modelling, Volume 29, Issue 2, September 2010, Pages 116-125. [DOI])

Performance Analysis of a Hybrid MPI/CUDA Implementation of the NAS-LU Benchmark

November 16th, 2010


The emergence of Graphics Processing Units (GPUs) as a potential alternative to conventional general-purpose processors has led to significant interest in these architectures by both the academic community and the High Performance Computing (HPC) industry. While GPUs look likely to deliver unparalleled levels of performance, the publication of studies claiming performance improvements in excess of 30,000x are misleading. Significant on-node performance improvements have been demonstrated for code kernels and algorithms amenable to GPU acceleration; studies demonstrating comparable results for full scientific applications requiring multiple-GPU architectures are rare.

In this paper we present an analysis of a port of the NAS LU benchmark to NVIDIA’s Compute Unified Device Architecture (CUDA) – the most stable GPU programming model currently available. Our solution is also extended to multiple nodes and multiple GPU devices.

Runtime performance on several GPUs is presented, ranging from low-end, consumer-grade cards such as the 8400GS to NVIDIA’s flagship Fermi HPC processor found in the recently released C2050. We compare the runtimes of these devices to several processors including those from Intel, AMD and IBM.

In addition to this we utilise a recently developed performance model of LU. With this we predict the runtime performance of LU on large-scale distributed GPU clusters, which are predicted to become commonplace in future high-end HPC architectural solutions.

(S.J. Pennycook, S.D. Harmond, S.A. Jarvis and G.R. Mudalige: “Implementation of the NAS-LU Benchmark”, 1st International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10), held as part of Supercomputing 2010 (SC’10), New Orleans, LA, USA. [PDF])

“Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU”

October 27th, 2010


In this work, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. The algorithm processes individual images represented as 2-dimensional matrices of single-precision floating-point values using operations involving dot-products and additions. We implement this algorithm on a NVIDIA Fermi GPU (Tesla 2050) using CUDA, and also manually parallelize it for the Intel Xeon X5680 (Westmere) and IBM Power7 multi-core processors. Pthreads and OpenMP with SSE and VSX vector intrinsics are used for the manually parallelized version on the multi-core CPUs. A number of optimizations were performed for the GPU implementation on the Fermi, including blocking for Fermi’s configurable on-chip memory architecture. Experimental results illustrate that on a single multi-core processor, the manually parallelized versions of the correlation application perform only a small order of factor slower than the CUDA version executing on the Fermi – 1.005s on Power7, 3.49s on Intel X5680, and 465ms on Fermi. On a two-processor Power7 system, performance approaches that of the Fermi (650ms), while the Intel version runs in 1.78s. These results conclusively demonstrate that performance of the GPU memory subsystem is critical to effectively harness its computational capabilities. For the correlation application, a significantly higher amount of effort was put into developing the GPU version when compared to the CPU ones (several days against few hours). Our experience presents compelling evidence that performance comparable to that of GPUs can be achieved with much greater productivity on modern multi-core CPUs

(R. Bordawekar and U. Bondhugula and R. Rao: “Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU”, Technical Report, IBM T. J. Watson Research Center, 2010 [PDF])


A Fast GEMM Implementation on a Cypress GPU

October 12th, 2010


We present benchmark results of optimized dense matrix multiplication kernels for a Cypress GPU. We write general matrix multiply (GEMM) kernels for single (SP), double (DP) and double-double (DDP) precision. Our SGEMM and DGEMM kernels show 73% and 87% of the theoretical performance of the GPU, respectively. Currently, our SGEMM and DGEMM kernels are fastest with one GPU chip to our knowledge. Furthermore, the performance of our matrix multiply kernel in DDP is 31 Gflop/s. This performance in DDP is more than 200 times faster than the performance in DDP on single core of a recent CPU (with mpack version 0.6.5). We describe our GEMM kernels with main focus on the SGEMM implementation since all GEMM kernels share common programming and optimization techniques. While a conventional wisdom of GPU programming recommends us to heavily use shared memory on GPUs, we show that texture cache is very effective on the Cypress architecture.

(N. Nakasato: “A Fast GEMM Implementation on a Cypress GPU”, 1st International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10) November 2010. A sample program is available at http://github.com/dadeba/dgemm_cypress)

HOOMD-blue 0.9.1 release

October 12th, 2010

HOOMD-blue performs general-purpose particle dynamics simulations on a single workstation, taking advantage of NVIDIA GPUs to attain a level of performance equivalent to many cores on a fast cluster. Flexible and configurable, HOOMD-blue is currently being used for coarse-grained molecular mynamics simulations of nano-maertials, glasses, and surfactants, dissipative particle dynamics simulations (DPD) of polymers, and crystallization of metals.

HOOMD-blue 0.9.1 adds many new features. Highlights include:

  • 10 to 50 percent faster performance over 0.9.0
  • DPD (Dissipative Particle Dynamics) capability
  • EAM (Embedded Atom Method) capability
  • Removed limitation on number of exclusions
  • Support for compute 2.1 devices (such as the GTX 460)
  • Support for CUDA 3.1
  • and more

HOOMD-blue 0.9.1 is available for download under an open source license. Check out the quick start tutorial to get started, or check out the full documentation to see everything it can do.

Page 20 of 57« First...10...1819202122...304050...Last »