As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moore’s law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs and FPGAs are becoming very popular in PC-based heterogeneous systems for speeding up compute intensive kernels of scientific, imaging and simulation applications. GPUs can execute hundreds of concurrent threads, while FPGAs provide customized concurrency for highly parallel kernels. However, exploiting the parallelism available in these applications is currently not a push-button task. Often the programmer has to expose the application’s fine and coarse grained parallelism by using special APIs. CUDA is such a parallel-computing API that is driven by the GPU industry and is gaining significant popularity. In this work, we adapt the CUDA programming model into a new FPGA design flow called FCUDA, which efficiently maps the coarse and fine grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SPMD CUDA thread blocks into parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multi-core accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.
(Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, Jason Cong and Wen-Mei W. Hwu, FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs, Proceedings of the 7th Symposium on Application Specific Processors, pp.35-42, July 2009. DOI: 10.1109/SASP.2009.5226333)
Molecular Workshop Series – Running and Developing MD Algorithms on GPUs with OpenMM and PyOpenMM + Intro to MD and Trajectory Analysis
Simbios is excited to announce its upcoming Molecular Dynamics (MD) Workshop Series, highlighting new capabilities within the recently released OpenMM 1.0 and introducing PyOpenMM for rapid MD code development with high performance:
Day 1: Running and Developing MD Algorithms on GPUs with OpenMM
Day 2: Introduction to MD and Trajectory Analysis with Markov State Models
When: March 1-2, 2010 (sign up for one or two days)
Where: Stanford University
RIKEN, one of the most prestigious research institutes in Japan, is the site of an upcoming computing workshop to be keynoted by NVIDIA CEO Jen–Hsun Huang. RIKEN conducts research across a wide range of fields, including physics, chemistry, medical science, biology, and engineering. The workshop will be held 1/28/10 – 1/29/10. See https://reg-nvidia.jp/public/seminar/view/3 for full details. In addition to keynote speeches by Jen-Hsun Huang and Professor Takayuki Aoki from Tokyo Institute of Technology, guest speakers at the event include Prof. Lorena Barba from Boston University, Mr. Mr. Eiji Fujii from Square ENIX, Dr. Mark Harris from NVIDIA (and GPGPU.org), and Dr. James Phillips from The University of Illinois at Urbana-Champaign.
From the workshop webpage:
“Accelerated Computing” is an old concept that is recently redefined in High-Performance Computing. It was started by dedicated machines like GRAPEs, but a great revolution has been occurring fueled by recent advancement in GPU Computing, both in hardware and in software such as CUDA C and OpenCL. This conference aims to review cutting edge technologies and scientific applications, as well as to discuss the future of the “Accelerator” approach in scientific and industrial HPC. Please join the conference for fruitful discussions on the future of HPC with highly-parallel processors.
We present an inter-architectural comparison of single- and double-precision direct n-body implementations on modern multicore platforms, including those based on the Intel Nehalem and AMD Barcelona systems, the Sony-Toshiba-IBM PowerXCell/8i processor, and NVIDA Tesla C870 and C1060 GPU systems. We compare our implementations across platforms on a variety of proxy measures, including performance, coding complexity, and energy efficiency.
Nitin Arora, Aashay Shringarpure, and Richard Vuduc. “Direct n-body kernels for multicore platforms.” In Proc. Int’l. Conf. Parallel Processing (ICPP), Vienna, Austria, September 2009 (direct link to PDF).
This undergraduate thesis and poster by Kajuki Fujiwara and Naohito Nakasato from the University of Aizu approach a common problem in astrophysics: the many-body problem, with both brute-force and hierarchical data structures for solving it on ATI GPUs. Abstracts:
The gravitational many-body problem is a problem concerning the movement of bodies, which are interacting through gravity. However, solving the gravitational many-body problem with a CPU takes a lot of time due to O(N^2) computational complexity. In this paper, we show how to speed-up the gravitational many-body problem by using GPU. After extensive optimizations, the peak performance obtained so far is about 1 Tflops.
The kd-tree is a fundamental tool in computer science. Among others, an application of the kd-tree search (oct-tree method) to fast evaluation of particle interactions and neighbor search is highly important since computational complexity of these problems are reduced from O(N^2) with a brute force method to O(N log N) with the tree method where N is a number of particles. In this paper, we present a parallel implementation of the tree method running on a graphic processor unit (GPU). We successfully run a simulation of structure formation in the universe very efficiently. On our system, which costs roughly $900, the run with N ~ 2.87×10^6 particles took 5.79 hours and executed 1.2×10^13 force evaluations in total. We obtained the sustained computing speed of 21.8 Gflops and the cost per Gflops of 41.6/Gflops that is two and half times better than the previous record in 2006.
Occasionally, we receive news submissions pointing us to interesting older papers that somehow slipped by without our notice. This post collects a few of those. If you want your work to be posted on GPGPU.org in a timely manner, please remember to use the news submission form.
Joshua A. Anderson, Chris D. Lorenz and Alex Travesset present and discuss molecular dynamics simulations and compare a single GPU against a 36-CPU cluster (General purpose molecular dynamics simulations fully implemented on graphics processing units, Journal of Computational Physics 227(10), May 2008, DOI 10.1016/j.jcp.2008.01.047).
Wen-mei Hwu et al. derive and discuss goals and concepts of programming models for fine-grained parallel architectures, from the point of view of both a programmer and a hardware /compiler designer, and analyze CUDA as one current representative (Implicitly parallel programming models for thousand-core microprocessors, Proceedings of DAC’07, June 2007, DOI 10.1145/1278480.1278669).
Jeremy Sugerman et al. present GRAMPS, a prototype implementation of future graphics hardware that allows pipelines to be specified as graphs in software (GRAMPS: A Programming Model for Graphics Pipelines, ACM Transactions on Graphics 28(1), January 2009, DOI 10.1145/1477926.1477930).
William R. Mark discusses concepts of future graphics architectures in this contribution to the 2008 ACM Queue special issue on GPUs (Future graphics architectures, ACM Queue 6(2), March/April 2008, DOI 10.1145/1365490.1365501).
BSGP by Qiming Hou et al. is a new programming language for general purpose GPU computing that achieves the same efficiency as well-tuned CUDA programs but makes code much easier to read, develop and maintain (BSGP: bulk-synchronous GPU programming, ACM Siggraph 2008, August 2008, DOI 10.1145/1399504.1360618).
This workshop will be held in conjunction with CIT 2010, Bradford, UK, 29 June – 01 July, 2010. From the announcement:
We are undergoing a new revolution in parallel processor technologies, especially the Graphics Processing Units. GPUs have become widely used nowadays to accelerate a broad range of applications, including computational finance, numerical computing, image/video processing, engineering simulations, quantum chemistry, just to name a few.
The goal of this workshop is to provide a forum for researchers and practitioners to discuss and share their research and development experiences and outputs on the massively parallel GPU platforms, software development tools, optimization techniques, parallel algorithm design, and all kinds of successful applications. We solicit original and previously unpublished papers addressing research challenges and advances towards the design, implementation and evaluation of massively parallel GPU computing.
Following the tremendous success of HotPar ’09, the Second USENIX Workshop on Hot Topics in Parallelism (HotPar ’10) will once again bring together researchers and practitioners doing innovative work in the area of parallel computing. Multicore processors are the pervasive computing platform of the future. This trend is driven by limits on energy consumption in computer systems and the poor energy performance of conventional microprocessors. Parallel architectures can potentially mitigate these problems, but this new computer architecture will only be successful if languages, systems, and applications can take advantage of parallel hardware. Navigating this change will require new concurrency-friendly programming paradigms, new methods of application design, new structures for system software, and new models of interaction between applications, compilers, operating systems, and hardware.
We request submissions of position papers that propose new directions for research of products in these areas, advocate non-traditional approaches to the problems engendered by parallelism, or potentially generate controversy and discussion. We encourage submissions from practitioners as well as from researchers. Read the rest of this entry »
Multi- and many-core microprocessors are being deployed in a broad spectrum of applications including Clusters, Clouds and Grids. Both conventional multi- and many-core processors, such as Intel Nehalem and IBM Power7 processors, and unconventional many-core processors, such as NVIDIA Tesla and AMD FireStream GPUs, hold the promise of increasing performance through parallelism. However, GPU approaches in parallelism are distinctly different from those of conventional multi- and many-core processors, which raises new challenges: For example, how do we optimize applications for conventional multi- and many-core processors? How do we reengineer applications to take advantage of GPUs’ tremendous computing power in a reasonable cost-benefit ratio? What are effective ways of using GPUs as accelerators? The goals of this workshop are to discuss these and other issues and bring together developers of application algorithms and experts in utilizing multi- and many-core processors. Accepted papers will be published in the CCGRID proceedings. Selected papers will be published in a special issue of the Journal Concurrency and Computation: Practice and Experience.
Taking inspiration from genetic screening techniques, researchers from MIT and Harvard have demonstrated a way to build better artificial visual systems with the help of low-cost, high-performance gaming hardware.
The neural processing involved in visually recognizing even the simplest object in a natural environment is profound — and profoundly difficult to mimic. Neuroscientists have made broad advances in understanding the visual system, but much of the inner workings of biologically based systems remain a mystery.
Using Graphics Processing Units (GPUs) — the same technology video game designers use to render life-like graphics — MIT and Harvard researchers are now making progress faster than ever before. “We made a powerful computing system that delivers over hundred fold speed-ups relative to conventional methods,” said Nicolas Pinto, a PhD candidate in James DiCarlo’s lab at the McGovern Institute for Brain Research at MIT. “With this extra computational power, we can discover new vision models that traditional methods miss.” Pinto co-authored the PLoS study with David Cox of the Visual Neuroscience Group at the Rowland Institute at Harvard.