June 8th, 2014
June 8th, 2014
Many current high-performance clusters include one or more GPUs per node in order to dramatically reduce application execution time, but the utilization of these accelerators is usually far below 100%. In this context, emote GPU virtualization can help to reduce acquisition costs as well as the overall energy consumption. In this paper, we investigate the potential overhead and bottlenecks of several “heterogeneous” scenarios consisting of client GPU-less nodes running CUDA applications and remote GPU-equipped server nodes providing access to NVIDIA hardware accelerators. The experimental evaluation is performed using three general-purpose multicore processors (Intel Xeon, Intel Atom and ARM Cortex A9), two graphics accelerators (NVIDIA GeForce GTX480 and NVIDIA Quadro M1000), and two relevant scientific applications (CUDASW++ and LAMMPS) arising in bioinformatics and molecular dynamics simulations.
(A. Castelló, J. Duato, R. Mayo, A. J. Peña, E. S. Quintana-Ortí, V. Roca, and F. Silla, “On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications”. Fourth International Conference on Smart Grids, Green Communications and IT Energy-aware Technologies, ENERGY 2014, Chamonix (France), pp. 57–62, 20 – 24 April 2014. [PDF])
June 3rd, 2014
We present a cache-aware method for accelerating texture-based volume rendering on a graphics processing unit (GPU). Because a GPU has hierarchical architecture in terms of processing and memory units, cache optimization is important to maximize performance for memory-intensive applications. Our method localizes texture memory reference according to the location of the viewpoint and dynamically selects the width and height of thread blocks (TBs) so that each warp, which is a series of 32 threads processed simultaneously, can minimize memory access strides. We also incorporate transposed indexing of threads to perform TB-level cache optimization for specific viewpoints. Furthermore, we maximize TB size to exploit spatial locality with fewer resident TBs. For viewpoints with relatively large strides, we synchronize threads of the same TB at regular intervals to realize synchronous ray propagation. Experimental results indicate that our cache-aware method doubles the worst rendering performance compared to those provided by the CUDA and OpenCL software development kits.
(Yuki Sugimoto, Fumihiko Ino, and Kenichi Hagihara: “Improving Cache Locality for GPU-based Volume Rendering”. Parallel Computing 40(5/6): 59-69, May 2014. [DOI])
June 3rd, 2014
Folding@Home is a large-scale volunteer distributed computing project started in 2000 by Vijay Pande, Stanford. For over a decade, Professor Pande’s group has increased the computing power of Folding@Home through the development of new software algorithms and infrastructure, such as the incorporation of new hardware innovations like GPUs. That tremendous computing power has enabled significant advances in the simulation and understanding of diseases like Alzheimer’s Disease, malaria, various cancers, and other diseases at the molecular scale. Professor Pande will give a brief introduction to Folding@Home and the successes in the project so far. He will also discuss plans to greatly enhance Folding@Home capabilities through new initiatives. This webinar is planned for June 3rd, 2014 at 9.00 AM Pacific Time. Register at: http://bit.ly/FolHome
May 27th, 2014
The conference focuses on the application of GPUs in High Energy Physics (HEP), expanding on the trend of previous workshops on the topic and pointing to establishing a recurrent series. The emerging paradigm of the use of graphic processors as powerful accelerators in data- and computation-intensive applications found fertile ground in the computing challenges of the HEP community and is currently object of active investigations. This follows a long established trend which sees the increased use of cheap off-the-shelf commercial units to achieve unprecedented performances in parallel data processing, thus leveraging on a very strong commitment of hardware producers to the huge market of computer graphics and games. These hardware advances comes together with the continuous development of proprietary and free software to expose the raw computing power of GPUs for general-purpose applications and scientific computing in particular. All different applications of massively parallel computing in HEP will be addressed, from computational speed-ups in online and offline data selection and analysis to hard real-time applications in low-level triggering, to MonteCarlo simulations for lattice QCD. Both current activities and plans for foreseen experiments and projects will be discussed, together with perspectives on the evolution of the hardware and software.
The conference is held in Pisa (Italy), 10.9.2014 – 12.9.2014. More information: http://www.pi.infn.it/gpu2014
May 27th, 2014
Analysis of functional magnetic resonance imaging (fMRI) data is becoming ever more computationally demanding as temporal and spatial resolutions improve, and large, publicly available data sets proliferate. Moreover, methodological improvements in the neuroimaging pipeline, such as non-linear spatial normalization, non-parametric permutation tests and Bayesian Markov Chain Monte Carlo approaches, can dramatically increase the computational burden. Despite these challenges, there do not yet exist any fMRI software packages which leverage inexpensive and powerful GPUs to perform these analyses. Here, we therefore present BROCCOLI, a free software package written in OpenCL that can be used for parallel analysis of fMRI data on a large variety of hardware configurations. BROCCOLI has, for example, been tested with an Intel CPU, an Nvidia GPU, and an AMD GPU. These tests show that parallel processing of fMRI data can lead to significantly faster analysis pipelines. This speedup can be achieved on relatively standard hardware, but further speed improvements require only a modest investment in GPU hardware. BROCCOLI (running on a GPU) can perform non-linear spatial normalization to a 1 mm3 brain template in 4–6 s, and run a second level permutation test with 10,000 permutations in about a minute. These non-parametric tests are generally more robust than their parametric counterparts, and can also enable more sophisticated analyses by estimating complicated null distributions. Additionally, BROCCOLI includes support for Bayesian first-level fMRI analysis using a Gibbs sampler. The new software is freely available under GNU GPL3 and can be downloaded from github: https://github.com/wanderine/BROCCOLI.
(A. Eklund, P. Dufort, M. Villani and S. LaConte: “BROCCOLI: Software for fast fMRI analysis on many-core CPUs and GPUs”. Front. Neuroinform. 8:24, 2014. [DOI])
May 27th, 2014
This master’s thesis by Markus Konrad analyzes the potentials of GPGPU on mobile devices such as smartphones or tablets. The question was, if and how the GPU on such devices can be used to speed up certain algorithms especially in the fields of image processing. GPU computing technologies such as OpenCL, OpenGL shaders, and Android RenderScript are assessed in the thesis. The abstract reads as follows:
This thesis studies how certain popular algorithms in the field of image and audio processing can be accelerated on mobile devices by means of parallel execution on their graphics processing unit (GPU). Several technologies with which this can be achieved are compared in terms of possible performance improvements, hardware and software support, as well as limitations of their programming model and functionality. The results of this research are applied in a practical project, consisting of performance improvements for marker detection in an Augmented Reality application for mobile devices.
The PDF is available for download and the source-code for some Android application prototypes is published on github.
May 18th, 2014
PARALUTION is a library for sparse iterative methods which can be performed on various parallel devices, including multi-core CPU, GPU (CUDA and OpenCL) and Intel Xeon Phi. The new 0.7.0 version provides the following new features:
- Windows support – full windows support for all backends (CUDA, OpenCL, OpenMP)
- Assembling function – new OpenMP parallel assembling function for sparse matrices (includes an update function for time-dependent problems)
- Direct (dense) solvers (for very small problems)
- (Restricted) Additive Schwarz preconditioners
- MATLAB/Octave plug-in
To avoid OpenMP overhead for small sized problems, the library will compute in serial if the size of the matrix/vector is below a pre-defined threshold. Internally, the OpenCL backend has been modified for simplified cross platform compilation.
May 15th, 2014
Join the free webinar on May 20th devoted to accelerating orthorectification, atmospheric correction, and transformations for big data with GPUs. Learn how GPU capabilities can improve time for processing large imagery 50-100 times faster. Amanda O’Connor, a Senior Solutions Engineer at Exelis will walk you through implementation of GPU processing for large imagery datasets, operational use of GPU processing for orthorectification and share benchmarks against desktop algorithms. To register follow this link: https://www2.gotomeeting.com/register/665929994.
May 15th, 2014
Boost.Compute v0.2 has been released! Boost.Compute is a header-only C++ library for GPGPU and parallel-computing based on OpenCL. It is available on GitHub and instructions for getting started can be found in the documentation. Since version 0.1 (released almost two months ago) new algorithms including unique(), search() and find_end() have been added, along with several bug fixes. See the project page on GitHub for more information: https://github.com/kylelutz/compute
A new version of the GPU-profiler for CUDA software stack is available at www.lab4241.com. The GPU-profiler is able to deliver per C++ source-code ‘inside’ kernel performance information in a simple, intuitive way, similar to known CPU domain profilers, like Quantify or Valgrind. The new version, GPUPROF version 0.3 (beta), includes improved stability, refined memory tracing, temporal memory analysis, and CUDA API-driver call tracing.
Page 2 of 108«12345...102030...»Last »