November 16th, 2010
November 16th, 2010
The 2011 Spring Simulation Multiconference will feature the 19th High Performance Computing Symposium (HPC 2011), devoted to the impact of high performance computing and communications on computer simulations. Advances in multicore and many-core architectures, networking, high end computers, large data stores, and middleware capabilities are ushering in a new era of high performance parallel and distributed simulations. Along with these new capabilities come new challenges in computing and system modeling. The goal of HPC 2011 is to encourage innovation in high performance computing and communication technologies and to promote synergistic advances in modeling methodologies and simulation. It will promote the exchange of ideas and information between universities, industry, and national laboratories about new developments in system modeling, high performance computing and communication, and scientific computing and simulation.
Topics of interest include:
- high performance/large scale application case studies,
- GPU, multicore, and many-core analysis and applications,
- power aware computing,
- cloud, distributed, and grid computing,
- asynchronous numerical methods and programming,
- hybrid system modeling and simulation,
- visualization and data management,
- problem solving environments,
- tools and environments for coupling parallel codes,
- parallel algorithms and architectures,
- high performance software tools,
- resilience at the simulation level,
- component technologies for high performance computing.
More information can be found on the webpage: http://www.cs.vt.edu/hpc2011/
November 16th, 2010
The goal of this workshop, held in conjunction with ASPLOS XVI (Newport Beach, CA USA, March 5-6 2011) is to provide a forum to discuss new and emerging general-purpose purpose programming environments and platforms, as well as evaluate applications that have been able to harness the horsepower provided by these platforms. This year’s work is particularly interested on new heterogeneous GPU platforms. Papers are being sought on many aspects of GPUs, including (but not limited to):
- GPU applications + GPU compilation
- GPU programming environments + GPU power/efficiency
- GPU architectures + GPU benchmarking/measurements
- Multi-GPU systems + Heterogeneous GPU platforms
Paper Submission: Authors should submit a 8 page paper in ACM double-column style using the directions on the conference website at http://www.ece.neu.edu/GPGPU.
Organizers: John Cavazos (University of Delaware) and David Kaeli (Northeastern University)
November 16th, 2010
Graphics processing units (GPUs) have traditionally been used in molecular modeling solely for visualization of molecular structures and animation of trajectories resulting from molecular dynamics simulations. Modern GPUs have evolved into fully programmable, massively parallel co-processors that can now be exploited to accelerate many scientific computations, typically providing about one order of magnitude speedup over CPU code and in special cases providing speedups of two orders of magnitude. This paper surveys the development of molecular modeling algorithms that leverage GPU computing, the advances already made and remaining issues to be resolved, and the continuing evolution of GPU technology that promises to become even more useful to molecular modeling. Hardware acceleration with commodity GPUs is expected to benefit the overall computational biology community by bringing teraflops performance to desktop workstations and in some cases potentially changing what were formerly batch-mode computational jobs into interactive tasks.
John E. Stone, David J. Hardy, Ivan S. Ufimtsev, and Klaus Schulten: “GPU-Accelerated Molecular Modeling Coming of Age”, Journal of Molecular Graphics and Modelling, Volume 29, Issue 2, September 2010, Pages 116-125. [DOI])
October 27th, 2010
The emergence of Graphics Processing Units (GPUs) as a potential alternative to conventional general-purpose processors has led to significant interest in these architectures by both the academic community and the High Performance Computing (HPC) industry. While GPUs look likely to deliver unparalleled levels of performance, the publication of studies claiming performance improvements in excess of 30,000x are misleading. Significant on-node performance improvements have been demonstrated for code kernels and algorithms amenable to GPU acceleration; studies demonstrating comparable results for full scientific applications requiring multiple-GPU architectures are rare.
In this paper we present an analysis of a port of the NAS LU benchmark to NVIDIA’s Compute Unified Device Architecture (CUDA) – the most stable GPU programming model currently available. Our solution is also extended to multiple nodes and multiple GPU devices.
Runtime performance on several GPUs is presented, ranging from low-end, consumer-grade cards such as the 8400GS to NVIDIA’s flagship Fermi HPC processor found in the recently released C2050. We compare the runtimes of these devices to several processors including those from Intel, AMD and IBM.
In addition to this we utilise a recently developed performance model of LU. With this we predict the runtime performance of LU on large-scale distributed GPU clusters, which are predicted to become commonplace in future high-end HPC architectural solutions.
(S.J. Pennycook, S.D. Harmond, S.A. Jarvis and G.R. Mudalige: “Implementation of the NAS-LU Benchmark”, 1st International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10), held as part of Supercomputing 2010 (SC’10), New Orleans, LA, USA. [PDF])
October 12th, 2010
In this work, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. The algorithm processes individual images represented as 2-dimensional matrices of single-precision floating-point values using operations involving dot-products and additions. We implement this algorithm on a NVIDIA Fermi GPU (Tesla 2050) using CUDA, and also manually parallelize it for the Intel Xeon X5680 (Westmere) and IBM Power7 multi-core processors. Pthreads and OpenMP with SSE and VSX vector intrinsics are used for the manually parallelized version on the multi-core CPUs. A number of optimizations were performed for the GPU implementation on the Fermi, including blocking for Fermi’s configurable on-chip memory architecture. Experimental results illustrate that on a single multi-core processor, the manually parallelized versions of the correlation application perform only a small order of factor slower than the CUDA version executing on the Fermi – 1.005s on Power7, 3.49s on Intel X5680, and 465ms on Fermi. On a two-processor Power7 system, performance approaches that of the Fermi (650ms), while the Intel version runs in 1.78s. These results conclusively demonstrate that performance of the GPU memory subsystem is critical to effectively harness its computational capabilities. For the correlation application, a significantly higher amount of effort was put into developing the GPU version when compared to the CPU ones (several days against few hours). Our experience presents compelling evidence that performance comparable to that of GPUs can be achieved with much greater productivity on modern multi-core CPUs
(R. Bordawekar and U. Bondhugula and R. Rao: “Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU”, Technical Report, IBM T. J. Watson Research Center, 2010 [PDF])
October 12th, 2010
We present benchmark results of optimized dense matrix multiplication kernels for a Cypress GPU. We write general matrix multiply (GEMM) kernels for single (SP), double (DP) and double-double (DDP) precision. Our SGEMM and DGEMM kernels show 73% and 87% of the theoretical performance of the GPU, respectively. Currently, our SGEMM and DGEMM kernels are fastest with one GPU chip to our knowledge. Furthermore, the performance of our matrix multiply kernel in DDP is 31 Gflop/s. This performance in DDP is more than 200 times faster than the performance in DDP on single core of a recent CPU (with mpack version 0.6.5). We describe our GEMM kernels with main focus on the SGEMM implementation since all GEMM kernels share common programming and optimization techniques. While a conventional wisdom of GPU programming recommends us to heavily use shared memory on GPUs, we show that texture cache is very effective on the Cypress architecture.
(N. Nakasato: “A Fast GEMM Implementation on a Cypress GPU”, 1st International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10) November 2010. A sample program is available at http://github.com/dadeba/dgemm_cypress)
October 7th, 2010
HOOMD-blue performs general-purpose particle dynamics simulations on a single workstation, taking advantage of NVIDIA GPUs to attain a level of performance equivalent to many cores on a fast cluster. Flexible and configurable, HOOMD-blue is currently being used for coarse-grained molecular mynamics simulations of nano-maertials, glasses, and surfactants, dissipative particle dynamics simulations (DPD) of polymers, and crystallization of metals.
HOOMD-blue 0.9.1 adds many new features. Highlights include:
- 10 to 50 percent faster performance over 0.9.0
- DPD (Dissipative Particle Dynamics) capability
- EAM (Embedded Atom Method) capability
- Removed limitation on number of exclusions
- Support for compute 2.1 devices (such as the GTX 460)
- Support for CUDA 3.1
- and more
HOOMD-blue 0.9.1 is available for download under an open source license. Check out the quick start tutorial to get started, or check out the full documentation to see everything it can do.
October 4th, 2010
Seattle, WA, 4 October, 2010 – Insilicos today announced the company has received a grant applying GPU computing to the role of epistasis in human disease. Funding comes from the National Human Genome Research Institute, part of the National Institutes of Health.
Epistasis refers to the interaction of two or more genes and is thought to play a major role in the genetics of susceptability to disease. One way to detect epistasis is through computationally-intensive statistical algorithms, such as those employed in data mining. Insilicos plans to exploit the concurrency inherent in these algorithms by using commodity graphics processors. Read the rest of this entry »
September 30th, 2010
The promise of exascale computing power is enforced by the many core technology, that involves all purpose CPUs and specialized computing devices, such as FPGA, DSP and GPUs. In particular GPUs, due also to their wide market footprint, have currently achieved one of the best core/cost rate in that category. Relying to some APIs provided by GPU vendors, the use of GPUs as general purpose massive parallel computing device (GPGPUs) is now routinely carried out in the scientific community. The increasing number of CPUs cores on chip has driven the development and spreading of the cloud computing, leveraging on consolidated technologies such as, but not limited to, grid computing and virtualization. In recent years the use of grid computing in high performance demanding applications in e-science has become a common issue. Elastic computer power and storage provided by a cloud infrastructure may be attractive but it is still limited by poor communication performance and lack of support in using GPGPUs within a virtual machine instance. The GPU Virtualization Service (gVirtuS) presented in this work tries to fill the gap between in-house hosted computing clusters, equipped with GPGPUs devices, and pay-for-use high performance virtual clusters deployed via public or private computing clouds. gVirtuS allows an instanced virtual machine to access GPGPUs in a transparent way, with an overhead slightly greater than a real machine/GPGPU setup. gVirtuS is hypervisor independent, and, even though it currently virtualizes nVIDIA CUDA based GPUs, it is not limited to a specific brand technology. The performance of the components of gVirtuS is assessed through a suite of tests in different deployment scenarios, such as providing GPGPU power to cloud computing based HPC clusters and sharing remotely hosted GPGPUs among HPC nodes.
(Giunta G., R. Montella, G. Agrillo, and G. Coviello: “A GPGPU transparent virtualization component for high performance computing clouds”. In P. D’Ambra, M. Guarracino, and D. Talia, editors, Euro-Par 2010 – Parallel Processing, volume 6271 of Lecture Notes in Computer Science, chapter 37, pages 379-391. Springer Berlin / Heidelberg, 2010. DOI. Link to project webpage with source code.)
The Second International Workshop on New Frontiers in High-performance and Hardware-aware Computing (HipHaC’11) is to be held in conjunction with the 17th IEEE International Symposium on High-Performance Computer Architecture (HPCA-17), colocated with 16th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP 2011), February 13, 2011, San Antonio, Texas, USA.
This workshop aims at combining new aspects of parallel, heterogeneous, and reconfigurable microprocessor technologies with concepts of high-performance computing and, particularly, numerical solution methods. Topics of interest for workshop submissions include (but are not limited to): Read the rest of this entry »