You are here: Home » Archives for Papers
February 6th, 2010
Abstract:
Dense matrix inversion is a basic procedure in many linear algebra algorithms. A computationally arduous step in most dense matrix inversion methods is the inversion of triangular matrices as produced by factorization methods such as LU decomposition. In this paper, we demonstrate how triangular matrix inversion (TMI) can be accelerated considerably by using commercial Graphics Processing Units (GPU) in a standard PC. Our implementation is based on a divide and conquer type recursive TMI algorithm, efficiently adapted to the GPU architecture. Our implementation obtains a speedup of 34x versus a CPU-based LAPACK reference routine, and runs at up to 54 gigaflops/s on a GTX 280 in double precision. Limitations of the algorithm are discussed, and strategies to cope with them are introduced. In addition, we show how inversion of an L- and U-matrix can be performed concurrently on a GTX 295 based dual-GPU system at up to 90 gigaflops/s.
(Florian Ries, Tommaso De Marco, Matteo Zivieri and Roberto Guerrieri, Triangular Matrix Inversion on Graphics Processing Units, Supercomputing 2009, DOI 10.1145/1654059.1654069)
Posted in Research | Tags: Linear Algebra, NVIDIA CUDA, Papers | Write a comment
February 2nd, 2010
Abstract:
We present HONEI, an open-source collection of libraries offering a hardware oriented approach to numerical calculations. HONEI abstracts the hardware, and applications written on top of HONEI can be executed on a wide range of computer architectures such as CPUs, GPUs and the Cell processor. We demonstrate the flexibility and performance of our approach with two test applications, a Finite Element multigrid solver for the Poisson problem and a robust and fast simulation of shallow water waves. By linking against HONEI’s libraries, we achieve a two-fold speedup over straight forward C++ code using HONEI’s SSE backend, and additional 3–4 and 4–16 times faster execution on the Cell and a GPU. A second important aspect of our approach is that the full performance capabilities of the hardware under consideration can be exploited by adding optimised application-specific operations to the HONEI libraries. HONEI provides all necessary infrastructure for development and evaluation of such kernels, significantly simplifying their development.
(Danny van Dyk, Markus Geveler, Sven Mallach, Dirk Ribbrock, Dominik Göddeke and Carsten Gutwenger: HONEI: A collection of libraries for numerical computations targeting multiple processor architectures. Computer Physics Communications 180(12), pp. 2534-2543, December 2009. DOI 10.1016/j.cpc.2009.04.018)
Posted in Developer Resources, Research | Tags: Cell BE, Fluid Simulation, Meta-programming, Multicore, NVIDIA CUDA, Papers | Write a comment
February 2nd, 2010
Abstract:
As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moore’s law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs and FPGAs are becoming very popular in PC-based heterogeneous systems for speeding up compute intensive kernels of scientific, imaging and simulation applications. GPUs can execute hundreds of concurrent threads, while FPGAs provide customized concurrency for highly parallel kernels. However, exploiting the parallelism available in these applications is currently not a push-button task. Often the programmer has to expose the application’s fine and coarse grained parallelism by using special APIs. CUDA is such a parallel-computing API that is driven by the GPU industry and is gaining significant popularity. In this work, we adapt the CUDA programming model into a new FPGA design flow called FCUDA, which efficiently maps the coarse and fine grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SPMD CUDA thread blocks into parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multi-core accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.
(Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, Jason Cong and Wen-Mei W. Hwu, FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs, Proceedings of the 7th Symposium on Application Specific Processors, pp.35-42, July 2009. DOI: 10.1109/SASP.2009.5226333)
Posted in Research | Tags: FPGAs, NVIDIA CUDA, Papers, Parallel Programming, Static Program Analysis | Write a comment
January 24th, 2010
From the abstract:
We present an inter-architectural comparison of single- and double-precision direct n-body implementations on modern multicore platforms, including those based on the Intel Nehalem and AMD Barcelona systems, the Sony-Toshiba-IBM PowerXCell/8i processor, and NVIDA Tesla C870 and C1060 GPU systems. We compare our implementations across platforms on a variety of proxy measures, including performance, coding complexity, and energy efficiency.
Nitin Arora, Aashay Shringarpure, and Richard Vuduc. “Direct n-body kernels for multicore platforms.” In Proc. Int’l. Conf. Parallel Processing (ICPP), Vienna, Austria, September 2009 (direct link to PDF).
Posted in Research | Tags: Astrophysics, NVIDIA CUDA, Papers, Physics Simulation | Write a comment
January 20th, 2010
This undergraduate thesis and poster by Kajuki Fujiwara and Naohito Nakasato from the University of Aizu approach a common problem in astrophysics: the many-body problem, with both brute-force and hierarchical data structures for solving it on ATI GPUs. Abstracts:
Fast Simulations of Gravitational Many-body Problem on RV770 GPU
Kazuki Fujiwara, Naohito Nakasato (University of Aizu)
Abstract:
The gravitational many-body problem is a problem concerning the movement of bodies, which are interacting through gravity. However, solving the gravitational many-body problem with a CPU takes a lot of time due to O(N^2) computational complexity. In this paper, we show how to speed-up the gravitational many-body problem by using GPU. After extensive optimizations, the peak performance obtained so far is about 1 Tflops.
Oct-tree Method on GPU
N.Nakasato
Abstract:
The kd-tree is a fundamental tool in computer science. Among others, an application of the kd-tree search (oct-tree method) to fast evaluation of particle interactions and neighbor search is highly important since computational complexity of these problems are reduced from O(N^2) with a brute force method to O(N log N) with the tree method where N is a number of particles. In this paper, we present a parallel implementation of the tree method running on a graphic processor unit (GPU). We successfully run a simulation of structure formation in the universe very efficiently. On our system, which costs roughly $900, the run with N ~ 2.87×10^6 particles took 5.79 hours and executed 1.2×10^13 force evaluations in total. We obtained the sustained computing speed of 21.8 Gflops and the cost per Gflops of 41.6/Gflops that is two and half times better than the previous record in 2006.
Posted in Research | Tags: Astrophysics, Data Structures, Papers, Posters | 1 Comment
January 17th, 2010
Occasionally, we receive news submissions pointing us to interesting older papers that somehow slipped by without our notice. This post collects a few of those. If you want your work to be posted on GPGPU.org in a timely manner, please remember to use the news submission form.
- Joshua A. Anderson, Chris D. Lorenz and Alex Travesset present and discuss molecular dynamics simulations and compare a single GPU against a 36-CPU cluster (General purpose molecular dynamics simulations fully implemented on graphics processing units, Journal of Computational Physics 227(10), May 2008, DOI 10.1016/j.jcp.2008.01.047).
- Wen-mei Hwu et al. derive and discuss goals and concepts of programming models for fine-grained parallel architectures, from the point of view of both a programmer and a hardware /compiler designer, and analyze CUDA as one current representative (Implicitly parallel programming models for thousand-core microprocessors, Proceedings of DAC’07, June 2007, DOI 10.1145/1278480.1278669).
- Jeremy Sugerman et al. present GRAMPS, a prototype implementation of future graphics hardware that allows pipelines to be specified as graphs in software (GRAMPS: A Programming Model for Graphics Pipelines, ACM Transactions on Graphics 28(1), January 2009, DOI 10.1145/1477926.1477930).
- William R. Mark discusses concepts of future graphics architectures in this contribution to the 2008 ACM Queue special issue on GPUs (Future graphics architectures, ACM Queue 6(2), March/April 2008, DOI 10.1145/1365490.1365501).
- BSGP by Qiming Hou et al. is a new programming language for general purpose GPU computing that achieves the same efficiency as well-tuned CUDA programs but makes code much easier to read, develop and maintain (BSGP: bulk-synchronous GPU programming, ACM Siggraph 2008, August 2008, DOI 10.1145/1399504.1360618).
- Finally, Che et al. and Garland et al. survey the field of GPU computing and discuss many different application domains. These articles are, in addition to the ones we have collected on the developer pages, recommended to GPGPU newcomers.
Posted in Research, Site News | Tags: Computer Architecture, Data-Parallel, Molecular Dynamics, NVIDIA CUDA, Papers, Programming Languages | Write a comment
December 8th, 2009
Abstract:
GPUs have recently evolved into very fast parallel coprocessors capable of executing general-purpose computations extremely efficiently. At the same time, multicore CPUs evolution continued and today’s CPUs have 4-8 cores. These two trends, however, have followed independent paths in the sense that we are aware of very few works that consider both devices cooperating to solve general computations. In this paper we investigate the coordinated use of CPU and GPU to improve efficiency of applications even further than using either device independently. We use Anthill runtime environment, a data-flow oriented framework in which applications are decomposed into a set of event-driven filters, where for each event, the runtime system can use either GPU or CPU for its processing. For evaluation, we use a histopathology application that uses image analysis techniques to classify tumor images for neuroblastoma prognosis. Our experimental environment includes dual and octa-core machines, augmented with GPUs and we evaluate our approach’s performance for standalone and distributed executions. Our experiments show that a pure GPU optimization of the application achieved a factor of 15 to 49 times improvement over the single-core CPU version, depending on the versions of the CPUs and GPUs. We also show that the execution can be further reduced by a factor of about 2 by using our runtime system that effectively choreographs the execution to run cooperatively both on GPU and on a single core of CPU. We improve on that by adding more cores, all of which were previously neglected or used ineffectively. In addition, the evaluation on a distributed environment has shown near linear scalability to multiple hosts.
(George Teodoro, Rafael Sachetto, Olcay Sertel, Metin Gurcan, Wagner Meira Jr., Umit Catalyurek, and Renato Ferreira. Coordinating the Use of GPU and CPU for Improving Performance of Compute Intensive Applications. IEEE Cluster 2009. New Orleans, LA, USA. Presentation. Paper.)
Posted in Research | Tags: Clusters, Papers | 3 Comments
December 8th, 2009
Abstract:
This paper presents, to the author’s knowledge, the first graphics processing unit (GPU) accelerated program that solves the evolution of interacting scalar fields in an expanding universe. We present the implementation in NVIDIA’s Compute Unified Device Architecture (CUDA) and compare the performance to other similar programs in chaotic inflation models. We report speedups between one and two orders of magnitude depending on the used hardware and software while achieving small errors in single precision. Simulations that used to last roughly one day to compute can now be done in hours and this difference is expected to increase in the future. The program has been written in the spirit of LATTICEEASY and users of the aforementioned program should find it relatively easy to start using CUDAEASY in lattice simulations. The program is available under the GNU General Public License.
The program is freely available at http://www.physics.utu.fi/theory/particlecosmology/cudaeasy/
(Jani Sainio. “CUDAEASY – a GPU Accelerated Cosmological Lattice Program”. submitted to Computer Physics Communications (under review). November 2009.)
Posted in Research | Tags: Astrophysics, Cosmology, NVIDIA CUDA, Open Source, Papers | Write a comment
November 25th, 2009
This paper in the Proceedings of the Institution of Civil Engineers describes an application of GPGPU for flood risk modelling by a team based at JBA Consulting in the UK. The model described here has since been used to produce flood risk maps for several countries in Europe.
Abstract:
“Two-dimensional (2D) flood inundation modelling is now an important part of flood risk management practice. Research in the fields of computational hydraulics and numerical methods, allied with advances in computer technology and software design, have brought 2D models into mainstream use. Even so, the models are computationally demanding and can take a long time to run, especially for large areas and at high spatial resolutions (for instance 2 × 2 m or smaller grid cells). There is thus strong motivation to accelerate 2D model codes. This paper demonstrates the use of technology from the computer graphics industry to accelerate a 2D diffusion wave (non-inertial) floodplain model. Over the past decade the market for computer games has driven the development of very fast, relatively low-cost ‘graphical processing units’. In recent years there has been a growing interest in this high-performance graphics hardware for scientific and engineering applications. This work adapted a flood model algorithm to run on a commodity personal computer graphics card. The results of a benchmark urban flood simulation were reproduced and the model run time reduced from 18 h to 9·5 min.”
(Lamb, R., Crossley, A. and Waller, S. 2009. A fast two-dimensional floodplain inundation model. Proceedings of the Institution of Civil Engineers – Water Management, Volume 162, Issue 6, pages 363–370. DOI: 10.1680/wama.2009.162.6.363)
Posted in Research | Tags: Fluid Simulation, Papers | Write a comment
November 25th, 2009
Abstract:
Cellular-level agent based modelling is reliant on either sequential processing environments or expensive and largely unavailable PC grids. The GPU offers an alternative architecture for such systems, however the steep learning curve associated with the GPU’s data parallel architecture has previously limited the uptake of this emerging technology. In this paper we demonstrate a template driven agent architecture which provides a mapping of XML model specifications and C language scripting to optimised Compute Unified Device Architecture (CUDA) for the GPU. Our work is validated though the implementation of a Keratinocyte model using limited range message communication with non-linear time simulation steps to resolve intercellular forces. The performance gain achieved over existing modelling techniques reduces simulation times from hours to seconds. The improvement of simulation performance allows us to present a real-time visualisation technique which was previously unobtainable.
(Richmond Paul, Coakley Simon, Romano Daniela (2009), Cellular Level Agent Based Modelling on the Graphics Processing Unit, (Best Student Paper) Proc. of HiBi09 – High Performance Computational Systems Biology, 14-16 October 2009, Trento, Italy)
Posted in Research | Tags: Agent-Based Modeling, Computational Biology, Papers | Write a comment