All talks from the 2010 GPU Technology Conference (as well as archived presentations from GTC 2009) are now available from NVIDIA.
For those who missed this year’s GPU Technology Conference (GTC) , and those who attended, but had a hard time choosing between all the concurrent sessions, NVIDIA has publicly released streamed recordings, video and slides from most GTC sessions.
There is content available for all types of programmers and developers. Those just getting started programming GPUs may want to take a look at the pre-conference tutorials, which provide an in-depth look at topics such as CUDA C, OpenCL, OpenGL and Parallel Nsight.
From a press release:
SANTA CLARA, CA — (Marketwire) — 10/28/2010 — Tianhe-1A, a new supercomputer revealed today at HPC 2010 China, has set a new performance record of 2.507 petaflops, as measured by the LINPACK benchmark, making it the fastest system in China and in the world today.
Tianhe-1A epitomizes modern heterogeneous computing by coupling massively parallel GPUs with multi-core CPUs, enabling significant achievements in performance, size and power. The system uses 7,168 NVIDIA® Tesla™ M2050 GPUs and 14,336 CPUs; it would require more than 50,000 CPUs and twice as much floor space to deliver the same performance using CPUs alone.
Read the rest of this entry »
GPU Systems releases Matlab language bindings for Libra SDK – heterogenous compute platform. Libra 1.2 version with runtime compiler and environment supports x86/x64 backends, OpenGL, OpenCL and CUDA compute backends. This release brings full BLAS 1,2,3 matrix/vector, dense/sparse, real/complex, single/double math library and extended functionality to Matlab computing platform executing on x86 CPUs & GPUs from AMD and NVIDIA.
“Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU”October 27th, 2010
In this work, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. The algorithm processes individual images represented as 2-dimensional matrices of single-precision floating-point values using operations involving dot-products and additions. We implement this algorithm on a NVIDIA Fermi GPU (Tesla 2050) using CUDA, and also manually parallelize it for the Intel Xeon X5680 (Westmere) and IBM Power7 multi-core processors. Pthreads and OpenMP with SSE and VSX vector intrinsics are used for the manually parallelized version on the multi-core CPUs. A number of optimizations were performed for the GPU implementation on the Fermi, including blocking for Fermi’s configurable on-chip memory architecture. Experimental results illustrate that on a single multi-core processor, the manually parallelized versions of the correlation application perform only a small order of factor slower than the CUDA version executing on the Fermi – 1.005s on Power7, 3.49s on Intel X5680, and 465ms on Fermi. On a two-processor Power7 system, performance approaches that of the Fermi (650ms), while the Intel version runs in 1.78s. These results conclusively demonstrate that performance of the GPU memory subsystem is critical to effectively harness its computational capabilities. For the correlation application, a significantly higher amount of effort was put into developing the GPU version when compared to the CPU ones (several days against few hours). Our experience presents compelling evidence that performance comparable to that of GPUs can be achieved with much greater productivity on modern multi-core CPUs
(R. Bordawekar and U. Bondhugula and R. Rao: “Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU”, Technical Report, IBM T. J. Watson Research Center, 2010 [PDF])
Researchers in industry academia are invited to submit their latest research results to the “Reconfigurable and GPU Computing” track at the 9th ACS/IEEE (pending approval) International Conference on Computer Systems and Applications (AICCSA 2011). The conference website is http://www.aiccsa.org. Deadline for submission is Nov. 8, 2010.
Recofigurable & GPGPU topics include:
- Algorithms and mathematical applications
- Languages and system software
- Hardware implementation and supporting technologies
- Theoretical models and performance estimation
- Simulation environments and prototyping
- Case studies and comparisons of real-life technologies
- Run time reconfiguration
- Energy efficiency
- Architectural issues and tradeoffs
- Hybrid GPU/reconfigurable systems
- Hardware accelerators
From a recent press release:
ACUSIM Software, Inc., a leader in computational fluid dynamics (CFD) technology and solutions, today announced the immediate availability of AcuSolve™ 1.8, the latest version of ACUSIM’s leading general-purpose, finite-element based CFD solver. ACUSIM will demonstrate AcuSolve 1.8 during two free webinars, taking place at 9:30 a.m. – 10:30 a.m. ET and 6:30 p.m. – 7:30 p.m. ET, on Oct. 26, 2010, at http://www.acusim.com/html/events.html.
Used by designers and research engineers with all levels of expertise, AcuSolve is highly differentiated by its accelerated speed, robustness, accuracy and multiphysics/multidisciplinary capabilities. Contributing to its robustness is the product’s Galerkin/Least-Square (GLS) finite element formulation and novel iterative linear equation solver for the fully coupled equation system. The combination of these two powerful technologies provides a highly stable and efficient solver, capable of handling unstructured meshes with tight boundary layers automatically generated from complex industrial geometries. Read the rest of this entry »
IMPETUS Afea is proud to announce the launch of IMPETUS Afea Solver (version 1.0).
The IMPETUS Afea Solver is a non-linear explicit finite element tool. It is developed to predict large deformations of structures and components exposed to extreme loading conditions. The tool is applicable to transient dynamics and quasi-static loading conditions. The primary focus of the IMPETUS Afea Solver is accuracy, robustness and simplicity for the user. The number of purely numerical parameters that the user has to provide as input is kept at a minimum. The IMPETUS Afea Solver is adapted to GPU technology; utilizing the computational force of a potent graphics card can considerably speed up your calculations.
For more information or requests please contact firstname.lastname@example.org
Michael Feldman of HPCWire writes:
MATLAB users with a taste for GPU computing now have a perfect reason to move up to the latest version. Release R2010b adds native GPGPU support that allows user to harness NVIDIA graphics processors for engineering and scientific computing. The new capability is provided within the Parallel Computing Toolbox and Distributed Computing Server.
[Editor's Note: as pointed out in the comments by John Melanakos (from Accelereyes), it may be worth checking out how MATLAB 2010b GPU support currently compares to Accelereyes Jacket.]
This webinar series is designed to help advance your OpenCL programming knowledge. Experts from AMD will cover both beginning and advanced topics starting with the basics of parallel and heterogeneous computing and an introduction to OpenCL, then progressing to more advanced topics such as performance optimization techniques and real world case studies.
This webinar describes how heterogeneous computing fits into the parallel computing paradigm, what problems it solves and what opportunities it presents. Read the rest of this entry »
We present benchmark results of optimized dense matrix multiplication kernels for a Cypress GPU. We write general matrix multiply (GEMM) kernels for single (SP), double (DP) and double-double (DDP) precision. Our SGEMM and DGEMM kernels show 73% and 87% of the theoretical performance of the GPU, respectively. Currently, our SGEMM and DGEMM kernels are fastest with one GPU chip to our knowledge. Furthermore, the performance of our matrix multiply kernel in DDP is 31 Gflop/s. This performance in DDP is more than 200 times faster than the performance in DDP on single core of a recent CPU (with mpack version 0.6.5). We describe our GEMM kernels with main focus on the SGEMM implementation since all GEMM kernels share common programming and optimization techniques. While a conventional wisdom of GPU programming recommends us to heavily use shared memory on GPUs, we show that texture cache is very effective on the Cypress architecture.
(N. Nakasato: “A Fast GEMM Implementation on a Cypress GPU”, 1st International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10) November 2010. A sample program is available at http://github.com/dadeba/dgemm_cypress)