From a press release:
New Software Solution Reduces Dependency on CPUs
PORTLAND, Ore.- SC09-Nov. 18, 2009- NVIDIA Corporation (Nasdaq: NVDA) and Mellanox Technologies Ltd. today introduced new software that will increase cluster application performance by as much as 30% by reducing the latency that occurs when communicating over Mellanox InfiniBand to servers equipped with NVIDIA Tesla™ GPUs.
The system architecture of a GPU-CPU server requires the CPU to initiate and manage memory transfers between the GPU and the InfiniBand network. The new software solution will enable Tesla GPUs to transfer data to pinned system memory that a Mellanox InfiniBand solution is able to read and transmit over the network. The result is increased overall system performance and efficiency.
“NVIDIA Tesla GPUs deliver large increases in performance across each node in a cluster, but in our production runs on TSUBAME 1 we have found that network communication becomes a bottleneck when using multiple GPUs,” said Prof. Satoshi Matsuoka from Tokyo Institute of Technology. “Reducing the dependency on the CPU by using InfiniBand will deliver a major boost in performance in high performance GPU clusters, thanks to the work of NVIDIA and Mellanox, and will further enhance the architectural advances we will make in TSUBAME2.0.” Read the rest of this entry »
The Portland Group has announced the general availability of its CUDA Fortran compiler for x64 and x86 processor-based systems running Linux, Mac OS X and Windows, including a 15-day trial license. From the press release:
Developed in collaboration with NVIDIA Corporation (Nasdaq: NVDA), the inventor of the graphics processing unit (GPU), PGI Release 2010 includes the first Fortran compiler compatible with the NVIDIA line of CUDA-enabled GPUs. A compiler is a software tool that translates applications from the high-level programming languages in which they are written by software developers into a binary form a computer can execute.
With developers taking advantage of the hundreds of cores and the relatively low cost of NVIDIA GPUs, programming to take advantage of the CUDA C compiler has become a popular means for accelerating the solution of complex computing problems. The PGI CUDA Fortran compiler is expected to accelerate GPU adoption even further in the High-Performance Computing (HPC) industry, where many important applications are written in Fortran. HPC is the field of technical computing engaged in the modeling and simulation of complex processes, such as ocean modeling, weather forecasting, environmental modeling, seismic analysis, bioinformatics and other areas.
The CUDA Fortran compiler is compatible with all NVIDIA GPUs that include Compute Capability 1.3 or higher, which includes most NVIDIA Quadro Professional Graphics solutions and all NVIDIA Tesla GPU Computing solutions. Developers are invited to download the PGI CUDA Fortran compiler from The Portland Group website at www.pgroup.com/support/downloads.php.
A 15-day trial license is available at no charge. In an effort to simplify adoption, NVIDIA has granted PGI rights to redistribute the relevant components of the CUDA Software Development Kit (SDK) as part of the PGI CUDA Fortran installation package.
In this work we describe a GPU implementation for an individual-based model for fish schooling. In this model each fish aligns its position and orientation with an appropriate average of its neighbors’ positions and orientations. This carries a very high computational cost in the so-called nearest neighbors search. By leveraging the GPU processing power and the new programming model called CUDA we implement an efficient framework which permits to simulate the collective motion of
high-density individual groups. In particular we present as a case study a simulation of motion of millions of fishes. We describe our implementation and present extensive experiments which
demonstrate the effectiveness of our GPU implementation.
(Ugo Erra, Bernardino Frola, Vittorio Scarano, Iain Couzin, An efficient GPU implementation for large scale individual-based simulation of collective behavior. Proceedings of High Performance Computational Systems Biology (HiBi09). October 14-16, 2009, Trento, Italy.
These webinars cover many topics including an introduction to C for CUDA, the OpenCL™ API, and performance optimization techniques, presented by NVIDIA DevTech Engineers with additional staff online to answer questions.
Full Schedule and short abstracts can be viewed at: http://developer.nvidia.com/object/gpu_computing_online.html
From the press release:
NVIDIA Corp. today introduced NVIDIA® Nexus, the industry’s first development environment for massively parallel computing that is integrated into Microsoft Visual Studio, the world’s most popular development environment for Windows-based solutions and Web applications and services.
“NVIDIA Nexus is going to improve programmer productivity immediately,” said Tarek El Dokor at Edge 3 Technologies. “An integrated GPU and CPU development solution is something Edge 3 has needed for a long time. The fact that it’s integrated into the Visual Studio development environment drastically reduces the learning curve.”
NVIDIA Nexus radically improves productivity by enabling developers of GPU computing applications to use the popular Microsoft Visual Studio-based tools and workflow in a transparent manner, without having to create a separate version of the application that incorporates diagnostic software calls. NVIDIA Nexus also includes the ability to run the code remotely on a different computer. Nexus includes advanced tools for simultaneously analyzing efficiency, performance, and speed of both the graphics processing unit (GPU) and central processing unit (CPU) to give developers immediate insight into how co-processing affects their applications.
Nexus is composed of three components:
Read the rest of this entry »
nCore Design announces the immediate availability of the NCT-300 Programming GPU Processors course. Conceived with the experienced C/C++ programmer in mind, NCT-300 covers concepts and approaches related to programming GPU processors using both CUDA and OpenCL. The course covers GPU hardware, memories, data transport, CUDA and OpenCL APIs, programming methods and performance optimization. It will enable students to understand the fundamental aspects of GPU programming and become proficient in a relatively short time. Extensive hands-on laboratories demonstrate how to apply common numerical methods using both native APIs and open source libraries. Other topics covered in the course include integrating the Intel Threading Building Blocks (TBB) abstraction layer with native GPU software APIs in addition to a GPU debugging primer.
The class brochure is available for download. To register, schedule an on-site session or contact nCore Design, go to http://www.ncoredesign.com/company/contact_us.
On September 30th NVIDIA unveiled its latest GPU architecture, codenamed “Fermi”. The first Fermi GPUs will contain 512 “CUDA Cores”, capable of more than 8x the double precision floating-point throughput of its predecessor, the GT200 GPU. The GPU also incorporates error correcting (ECC) memories and caches, a new cache hierarchy, increased shared memory and register file sizes, and the ability to execute C++ programs.
From the press release:
SANTA CLARA, Calif. -Sep. 30, 2009- NVIDIA Corp. today introduced its next generation CUDA™ GPU architecture, codenamed “Fermi”. An entirely new ground-up design, the “Fermi”™ architecture is the foundation for the world’s first computational graphics processing units (GPUs), delivering breakthroughs in both graphics and GPU computing.
“NVIDIA and the Fermi team have taken a giant step towards making GPUs attractive for a broader class of programs,” said Dave Patterson, director Parallel Computing Research Laboratory, U.C. Berkeley and co-author of Computer Architecture: A Quantitative Approach. “I believe history will record Fermi as a significant milestone.”
Presented at the company’s inaugural GPU Technology Conference, in San Jose, California, “Fermi” delivers a feature set that accelerates performance on a wider array of computational applications than ever before. Joining NVIDIA’s press conference was Oak Ridge National Laboratorywho announced plans for a new supercomputer that will use NVIDIA® GPUs based on the “Fermi” architecture. “Fermi” also garnered the support of leading organizations including Bloomberg, Cray, Dell, HP, IBM and Microsoft.
Read the rest of this entry »
General-purpose application development for GPUs (GPGPU) has recently gained momentum as a cost-effective approach for accelerating data-and compute-intensive applications. It has been driven by the introduction of C-based programming environments such as NVIDIA’s CUDA, OpenCL, and Intel’s Ct. While significant effort has been focused on developing and evaluating applications and software tools, comparatively little has been devoted to the analysis and characterization of applications to assist future work in compiler optimizations, application re-structuring, and micro-architecture design.
This paper proposes a set of metrics for GPU workloads and uses these metrics to analyze the behavior of GPU programs. We report on an analysis of over 50 kernels and applications including the full NVIDIA CUDA SDK and UIUC’s Parboil Benchmark Suite covering control flow, data flow, parallelism, and memory behavior. The analysis was performed using a full function emulator we developed that implements the NVIDIA virtual machine referred to as PTX (Parallel Thread eXecution architecture) – a machine model and low-level virtual ISA that is representative of ISAs for data-parallel execution. The emulator can execute compiled kernels from the CUDA compiler, currently supports the full PTX 1.4 specification, and has been validated against the full CUDA SDK. The results quantify the importance of optimizations such as those for branch re-convergence, the prevalance of sharing between threads, and highlights opportunities for additional parallelism.
(Andrew Kerr, Gregory Diamos, Sudhakar Yalamanchili, A Characterization and Analysis of PTX Kernels. International Symposium on Workload Characterization (IISWC). 2009.)
A public beta release of the CUDA-enabled Fortran Compiler from PGI enables programmers to write code in Fortran for NVIDIA CUDA GPUs. From a press release:
What: NVIDIA today announced that a public beta release of the PGI® CUDA-enabled Fortran compiler is now available. Developed in collaboration with The Portland Group® , it is the first Fortran compiler compatible with NVIDIA® CUDA™ -enabled graphics processing units (GPUs).
A compiler is a software tool that translates applications from the high-level programming languages used by software developers into a binary form a computer can execute.
Why: GPU computing with the CUDA C-compiler has gained significant momentum in the High-Performance Computing (HPC) space as it enables developers to get transformative increases in performance with minimal coding required.
Fortran is particularly well suited to numeric computation and scientific computing and remains widely used in a wide range of applications such as weather modeling, computational fluid dynamics and seismic processing.
Where can I get it?: Read the rest of this entry »
OpenCurrent is an open source C++ library for solving Partial Differential Equations (PDEs) over regular grids using the CUDA platform from NVIDIA. It breaks down a PDE into 3 basic objects, “Grids”, “Solvers,” and “Equations.” “Grid” data structures efficiently implement regular 1D, 2D, and 3D arrays in both double and single precision. Grids support operations like computing linear combinations, managing host-device memory transfers, interpolating values at non-grid points, and performing array-wide reductions. “Solvers” use these data structures to calculate terms arising from discretizations of PDEs, such as finite-difference based advection and diffusion schemes, and a multigrid solver for Poisson equations. These computational building blocks can be assembled into complete “Equation” objects that solve time-dependent PDEs. One such Equation solver is an incompressible Navier-Stokes solver that uses a second-order Boussinesq model. This equation solver is fully validated, and has been used to study Rayleigh-Benard convection under a variety of different regimes. Benchmarks show it to perform about 8 times faster than an equivalent Fortran code running on an 8-core Xeon.
Read the rest of this entry »