The following new webinars about NVIDIA Tesla K20 have been announced. During these live webinars, developers will be able to get answers directly from the presenters.
PARALUTION is a library for sparse iterative methods with special focus on multi-core and accelerator technology such as GPUs. In particular, it incorporates fine-grained parallel preconditioners designed to expolit modern multi-/many-core devices. Based on C++, it provides a generic and flexible design and interface which allow seamless integration with other scientific software packages. The library is open source and released under GPL. Key features are:
- OpenMP, CUDA and OpenCL support
- No special hardware/library requirement
- Portable code and results across all hardware
- Many sparse matrix formats
- Various iterative solvers/preconditioners
- Generic and robust design
- Plug-in for the finite element package Deal.II
- Documentation: user manual (pdf), reports, doxygen
More information, including documentation and case studies, is available at http://www.paralution.com.
A free, pre-alpha release of Lab4241′s GPGPU profiler is now available at www.lab4241.com. It provides source-code-line performance profiling for C or C++ code and CUDA kernels in a non-intrusive way. The profiler enables the developer to a seamless evaluation of used GPU resources (execution counts, memory access, branch diversions, etc.) per source-line, along with result evaluation in a simple, intuitive GUI, similar as with known CPU profilers like Quantify or valgrind.
This class teaches the fundamentals of parallel computing with the GPU and the CUDA programming environment. Examples are based on a series of image processing algorithms, such as those in Photoshop or Instagram. Programming and running assignments on high-end GPUs is possible, even if you don’t own one yourself. The course started Monday 4th Feb 2013 so there is still time to join. More information and enrollment: https://www.udacity.com/course/cs344.
The paper presents techniques for generating very large finite-element matrices on a multicore workstation equipped with several graphics processing units (GPUs). To overcome the low memory size limitation of the GPUs, and at the same time to accelerate the generation process, we propose to generate the large sparse linear systems arising in finite-element analysis in an iterative manner on several GPUs and to use the graphics accelerators concurrently with CPUs performing collection and addition of the matrix fragments using a fast multithreaded procedure. The scheduling of the threads is organized in such a way that the CPU operations do not affect the performance of the process, and the GPUs are idle only when data are being transferred from GPU to CPU. This approach is verified on two workstations: the first consists of two 6-core Intel Xeon X5690 processors with two Fermi GPUs: each GPU is a GeForce GTX 590 with two graphics processors and 1.5 GB of fast RAM; the second workstation is equipped with two Tesla C2075 boards carrying 6 GB of RAM each and two 12-core Opteron 6174s. For the latter setup, we demonstrate the fast generation of sparse finite-element matrices as large as 10 million unknowns, with over 1 billion nonzero entries. Comparing with the single-threaded and multithreaded CPU implementations, the GPU-based version of the algorithm based on the ideas presented in this paper reduces the finite-element matrix-generation time in double precision by factors of 100 and 30, respectively.
(Dziekonski, A., Sypek, P., Lamecki, A. and Mrozowski, M.: “Generation of large finite-element matrices on multiple graphics processors”. International Journal on Numerical Methoths in Engineering, 2012, in press. [DOI])
In this paper, we present a scalable, numerically stable, high-performance tridiagonal solver. The solver is based on the SPIKE algorithm for partitioning a large matrix into small independent matrices, which can be solved in parallel. For each small matrix, our solver applies a general 1-by-1 or 2-by-2 diagonal pivoting algorithm, which is also known to be numerically stable. Our paper makes two major contributions. First, our solver is the first numerically stable tridiagonal solver for GPUs. Our solver provides comparable quality of stable solutions to Intel MKL and Matlab, at speed comparable to the GPU tridiagonal solvers in existing packages like CUSPARSE. It is also scalable to multiple GPUs and CPUs. Second, we present and analyze two key optimization strategies for our solver: a high-throughput data layout transformation for memory efficiency, and a dynamic tiling approach for reducing the memory access footprint caused by branch divergence.
(Chang, Li-Wen and Stratton, John A. and Kim, Hee-Seok and Hwu, Wen-mei W.: “A scalable, numerically stable, high-performance tridiagonal solver using GPUs”, Supercomputing 2012. [WWW])
AccelerEyes has released dates for their upcoming CUDA and OpenCL training courses.
- Feb 25-26, Houston, TX
- Mar 4-5, Washington D.C.
- Mar 25-26, Los Angeles, CA
- Apr 9-10, Seattle, WA
- Apr 15-16, San Francisco, CA
- Feb 27-28, Houston, TX
- Mar 6-7, Washington D.C.
- Mar 27-28, Los Angeles, CA
- Apr 11-12, Seattle, WA
- Apr 17-18, San Francisco, CA
More information can be found on the courses’ webpages.
Acceleware has recently announced four courses on parallel programming:
- OpenCL on AMD APU CPUs: Jan 29 to Feb 1, 2013, Chicago, IL and Apr 9 to Apr 12, 2013, Los Angeles, CAL
- 4 Day CUDA Course with an Oil and Gas focus: Mar 12 to Mar 15, 2013, Houston, TX
- 4 Day C++ AMP Training: Apr 23 to Apr 26, 2013, Seattle, WA
More information is available on the courses’ webpages.
We present MGPU, a C++ programming library targeted at single-node multi-GPU systems. Such systems combine disproportionate floating point performance with high data locality and are thus well suited to implement real-time algorithms. We describe the library design, programming interface and implementation details in light of this specific problem domain. The core concepts of this work are a novel kind of container abstraction and MPI-like communication methods for intra-system communication. We further demonstrate how MGPU is used as a framework for porting existing GPU libraries to multi-device architectures. Putting our library to the test, we accelerate an iterative non-linear image reconstruction algorithm for real-time magnetic resonance imaging using multiple GPUs. We achieve a speed-up of about 1.7 using 2 GPUs and reach a final speed-up of 2.1 with 4 GPUs. These promising results lead us to conclude that multi-GPU systems are a viable solution for real-time MRI reconstruction as well as signal-processing applications in general.
This paper presents an efficient technique for fast generation of sparse systems of linear equations arising in computational electromagnetics in a finite element method using higher order elements. The proposed approach employs a graphics processing unit (GPU) for both numerical integration and matrix assembly. The performance results obtained on a test platform consisting of a Fermi GPU (1x Tesla C2075) and a CPU (2x twelve-core Opterons), indicate that the GPU implementation of the matrix generation allows one to achieve speedups by a factor of 81 and 19 over the optimized single-and multi-threaded CPU-only implementations, respectively.
(Adam Dziekonski et al., “Finite Element Matrix Generation on a GPU”, Progress In Electromagnetics Research 128:249-265, 2012. [PDF])