<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>GPGPU &#187; Tag: Papers :: GPGPU.org</title>
	<atom:link href="http://gpgpu.org/tag/papers/feed" rel="self" type="application/rss+xml" />
	<link>http://gpgpu.org</link>
	<description>General-Purpose Computation on Graphics Hardware</description>
	<lastBuildDate>Wed, 01 Feb 2012 07:56:53 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Using GPUs to Accelerate Installed Antenna Performance Simulations</title>
		<link>http://gpgpu.org/2012/01/09/installed-antenna-performance-simulations</link>
		<comments>http://gpgpu.org/2012/01/09/installed-antenna-performance-simulations#comments</comments>
		<pubDate>Mon, 09 Jan 2012 09:48:46 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[CEM]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Ray Tracing]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=4359</guid>
		<description><![CDATA[Abstract: Savant is a asymptotic ray-tracing CEM tool used to predict the performance of antennas installed on electrically large platforms, including far-field antenna patterns, near-field distributions, and antenna-to-antenna coupling. Savant is based on the shooting and bouncing rays (SBR) formulation. While asymptotic solvers like Savant have significantly smaller computational and memory requirements for electrically large [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>Savant is a asymptotic ray-tracing CEM tool used to predict the performance of antennas installed on electrically large platforms, including far-field antenna patterns, near-field distributions, and antenna-to-antenna coupling. Savant is based on the shooting and bouncing rays (SBR) formulation. While asymptotic solvers like Savant have significantly smaller computational and memory requirements for electrically large problems than full-wave techniques, the computation costs still increase significantly with frequency and simulation fidelity, and such solvers benefit greatly from parallelization techniques. Graphics processing units (GPUs) are throughput-oriented processing devices that are well suited for the mathematically intensive workloads found in CEM solvers. Current GPUs contain hundreds of processing units, leverage thousands of threads, and can execute over one trillion floating-point operations per second. A hybrid CPU and GPU parallelization approach has been developed for Savant, providing significant speedups compared to CPU-only implementations. Results from the execution of GPU-accelerated Savant on multiple case studies will be presented.</p></blockquote>
<p>(T. Courtney, J. E. Stone and R. Kipp, <em>“Using GPUs to Accelerate installed antenna performance simulations,”</em> Proc. Allerton Antenna Symposium, Sept. 2011, Monticello, IL. [<a title="direct link to PDF" href="http://www.delcross.com/publications/SavantGPU-Allerton2011.pdf" target="_blank">PDF</a>])</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2012/01/09/installed-antenna-performance-simulations/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>On the Acceleration of Wavefront Applications using Distributed Many-Core Architectures</title>
		<link>http://gpgpu.org/2011/12/14/acceleration-of-wavefront-applications</link>
		<comments>http://gpgpu.org/2011/12/14/acceleration-of-wavefront-applications#comments</comments>
		<pubDate>Wed, 14 Dec 2011 09:26:00 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Clusters]]></category>
		<category><![CDATA[High-Performance Computing]]></category>
		<category><![CDATA[Linear Algebra]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=4264</guid>
		<description><![CDATA[Abstract: In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applications—a ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applications—a ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P).</p>
<p>Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures.</p></blockquote>
<p>(Pennycook, S.J., Hammond, S.D., Mudalige, G.R., Wright, S.A. and Jarvis, S.A.: <em>&#8220;On the Acceleration of Wavefront Applications using Distributed Many-Core Architectures&#8221;</em>,  The Computer Journal (in press) [<a href="http://dx.doi.org/10.1093/comjnl/bxr073" target="_blank">DOI</a>] [<a href="http://eprints.dcs.warwick.ac.uk/787/" target="_blank">PREPRINT</a>])</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2011/12/14/acceleration-of-wavefront-applications/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parallel Accelerating for Star Catalogue Retrieval Algorithm using GPUs</title>
		<link>http://gpgpu.org/2011/11/16/star-catalogue-retrieval</link>
		<comments>http://gpgpu.org/2011/11/16/star-catalogue-retrieval#comments</comments>
		<pubDate>Wed, 16 Nov 2011 11:39:14 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Astronomy]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=4164</guid>
		<description><![CDATA[Abstract A GPU-based parallel star retrieval method is proposed to improve the efficiency of searching stars from star catalogue in computer simulation, especially when the FOV (Field of View) is large. By the novel algorithm, the stars in catalogue are classified and stored in different zones using latitude and longitude zoning method firstly. Based on [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract</p>
<blockquote><p>A GPU-based parallel star retrieval method is proposed to improve the efficiency of searching stars from star catalogue in computer simulation, especially when the FOV (Field of View) is large. By the novel algorithm, the stars in catalogue are classified and stored in different zones using latitude and longitude zoning method firstly. Based on the easily accessible star catalogue, the star zones that FOV covers can be computed exactly by constructing a spherical triangle around the FOV. As a result, the searching scope is reduced effectively. Finally, we use CUDA computation architecture to run the process of star retrieving from those star zones parallel on GPU. Experimental results show that, in comparison with CPU-oriented implementation, the proposed algorithm achieves up to tens of times speedup, and the processing time is limited within a millisecond level in large FOV and wide star magnitude span. It meets the requirement of real-time simulation.</p></blockquote>
<p><span>(Chao Li, Liqiang Zhang, Jiaze Wu, and Changwen Zheng, <em>&#8220;Parallel Accelerating for Star Catalogue Retrieval Algorithm using GPUs&#8221;</em>, Journal of Astronautics, 2012)</span></p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2011/11/16/star-catalogue-retrieval/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>A fast algorithm of simulating star map for star sensor</title>
		<link>http://gpgpu.org/2011/11/16/star-sensor</link>
		<comments>http://gpgpu.org/2011/11/16/star-sensor#comments</comments>
		<pubDate>Wed, 16 Nov 2011 11:34:53 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Astronomy]]></category>
		<category><![CDATA[Papers]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=4163</guid>
		<description><![CDATA[Abstract In order to test the function and performance of star sensor on the ground, a fast method for simulating star map is presented. The algorithm adopts instantanesous coordinate of star and improves the star searching efficiency by optimizing the zone partitioning method for star catalogue. We overcome the low accuracy of the latitude and [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract</p>
<blockquote><p>In order to test the function and performance of star sensor on the ground, a fast method for simulating star map is presented. The algorithm adopts instantanesous coordinate of star and improves the star searching efficiency by optimizing the zone partitioning method for star catalogue. We overcome the low accuracy of the latitude and longitude’s span that FOV overlays by proposing a new spherical right-angled triangle method and the searching scope is reduced highly; meanwhile, the simulation model for star brightness is also built based on adopted star catalogue. Simulation study is conducted for the demonstration of the algorithm. The proposed approach meets the requirement of wide magnitude range and short simulation period.</p></blockquote>
<p>(Chao Li, Changwen Zheng, Jiaze Wu, and Liqiang Zhang, <em>&#8220;A fast algorithm of simulating star map for star sensor&#8221;</em>, Proceedings of the 3rd IEEE International Conferernce on Computer and Network Technology (IEEE ICCNT), 2011)</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2011/11/16/star-sensor/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Accelerating GPU Kernels for Dense Linear Algebra</title>
		<link>http://gpgpu.org/2011/11/14/accelerating-gpu-kernels-for-dense-linear-algebra</link>
		<comments>http://gpgpu.org/2011/11/14/accelerating-gpu-kernels-for-dense-linear-algebra#comments</comments>
		<pubDate>Mon, 14 Nov 2011 07:51:44 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Dense Linear Algebra]]></category>
		<category><![CDATA[Numerical Algorithms]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=4148</guid>
		<description><![CDATA[Abstract: Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA) libraries, and therefore have to be highly optimized. We present some techniques and implementations that signiﬁcantly accelerate the corresponding routines from currently available libraries for GPUs. In particular, Pointer Redirecting – a set of GPU speciﬁc [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA) libraries, and therefore have to be highly optimized. We present some techniques and implementations that signiﬁcantly accelerate the corresponding routines from currently available libraries for GPUs. In particular, Pointer Redirecting – a set of GPU speciﬁc optimization techniques –allows us to easily remove performance oscillations associated with problem dimensions not divisible by ﬁxed blocking sizes. For example, applied to the matrix-matrix multiplication routines, depending on the hardware conﬁguration and routine parameters, this can lead to two times faster algorithms. Similarly, the matrix-vector multiplication can be accelerated more than two times in both single and double precision arithmetic. Additionally, GPU speciﬁc acceleration techniques are applied to develop new kernels (e.g. syrk, symv) that are up to 20x faster than the currently available kernels. We present these kernels and also show their acceleration e!ect to higher level dense linear algebra routines. The accelerated kernels are now freely available through the MAGMA BLAS library.</p></blockquote>
<p>(R. Nath, S. Tomov and J. Dongarra: <em>&#8220;Accelerating GPU Kernels for Dense Linear Algebra&#8221;</em>, VECPAR 2010. [<a href="http://icl.cs.utk.edu/projectsfiles/magma/pubs/Rajib_Nath_VECPAR10.pdf" target="_blank">PDF</a>])</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2011/11/14/accelerating-gpu-kernels-for-dense-linear-algebra/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>An Improved MAGMA GEMM For Fermi Graphics Processing Units</title>
		<link>http://gpgpu.org/2011/11/14/magma-gemm-fermi</link>
		<comments>http://gpgpu.org/2011/11/14/magma-gemm-fermi#comments</comments>
		<pubDate>Mon, 14 Nov 2011 07:45:32 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Dense Linear Algebra]]></category>
		<category><![CDATA[Numerical Algorithms]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[NVIDIA FERMI]]></category>
		<category><![CDATA[Papers]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=4147</guid>
		<description><![CDATA[Abstract: We present an improved matrix–matrix multiplication routine (General Matrix Multiply [GEMM]) in the MAGMA BLAS library that targets the NVIDIA Fermi graphics processing units (GPUs) using Compute Unified Data Architecture (CUDA). We show how to modify the previous MAGMA GEMM kernels in order to make a more efficient use of the Fermi’s new architectural [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>We present an improved matrix–matrix multiplication routine (General Matrix Multiply [GEMM]) in the MAGMA BLAS library that targets the NVIDIA Fermi graphics processing units (GPUs) using Compute Unified Data Architecture (CUDA). We show how to modify the previous MAGMA GEMM kernels in order to make a more efficient use of the Fermi’s new architectural features, most notably their extended memory hierarchy and memory sizes. The improved kernels run at up to 300 GFlop/s in double precision and up to 645 GFlop/s in single precision arithmetic (on a C2050), which is correspondingly 58% and 63% of the theoretical peak. We compare the improved kernels with the currently available version in CUBLAS 3.1. Further, we show the effect of the new kernels on higher-level dense linear algebra (DLA) routines such as the one-sided matrix factorizations, and compare their performances with corresponding, currently available routines running on homogeneous multicore systems.</p></blockquote>
<p>(R. Nath and S. Tomov and J. Dongarra: <em>&#8220;An Improved MAGMA GEMM For Fermi Graphics Processing Units&#8221;</em>,  International Journal of High Performance Computing Applications. 24(4), 511-515, 2010. [<a href="http://dx.doi.org/10.1177/1094342010385729" target="_blank">DOI</a>] [<a href="http://www.netlib.org/lapack/lawnspdf/lawn227.pdf" target="_blank">PREPRINT</a>])</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2011/11/14/magma-gemm-fermi/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MIDeA: A Multi-Parallel Intrusion Detection Architecture</title>
		<link>http://gpgpu.org/2011/11/03/midea-a-multi-parallel-intrusion-detection-architecture</link>
		<comments>http://gpgpu.org/2011/11/03/midea-a-multi-parallel-intrusion-detection-architecture#comments</comments>
		<pubDate>Thu, 03 Nov 2011 09:35:41 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Network Intrusion Detection]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Pattern Matching]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=4111</guid>
		<description><![CDATA[Abstract: Network intrusion detection systems are faced with the challenge of identifying diverse attacks, in extremely high speed networks. For this reason, they must operate at multi-Gigabit speeds, while performing highly-complex per-packet and per-ﬂow data processing. In this paper, we present a multi-parallel intrusion detection architecture tailored for high speed networks. To cope with the [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>Network intrusion detection systems are faced with the challenge of identifying diverse attacks, in extremely high speed networks. For this reason, they must operate at multi-Gigabit speeds, while performing highly-complex per-packet and per-ﬂow data processing. In this paper, we present a multi-parallel intrusion detection architecture tailored for high speed networks. To cope with the increased processing throughput requirements, our system parallelizes network trafﬁc processing and analysis at three levels, using multi-queue NICs, multiple CPUs, and multiple GPUs. The proposed design avoids locking, optimizes data transfers between the different processing units, and speeds up data processing by mapping different operations to the processing units where they are best suited. Our experimental evaluation shows that our prototype implementation based on commodity off-the-shelf equipment can reach processing speeds of up to 5.2 Gbit/s with zero packet loss when analyzing trafﬁc in a real network, whereas the pattern matching engine alone reaches speeds of up to 70 Gbit/s, which is an almost four times improvement over prior solutions that use specialized hardware.</p></blockquote>
<p>(Giorgos Vasiliadis, Michalis Polychronakis, and Sotiris Ioannidis: <em>&#8220;MIDeA: A Multi-Parallel Intrusion Detection Architecture&#8221;</em>, Proceedings of the 18th ACM Conference on Computer and Communications Security (CCS), Oct. 2011. [<a href="http://dcs.ics.forth.gr/Activities/papers/midea.css11.pdf" target="_blank">PDF</a>])</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2011/11/03/midea-a-multi-parallel-intrusion-detection-architecture/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parallelization and Characterization of Pattern Matching using GPUs</title>
		<link>http://gpgpu.org/2011/10/29/parallelization-and-characterization-of-pattern-matching-using-gpus</link>
		<comments>http://gpgpu.org/2011/10/29/parallelization-and-characterization-of-pattern-matching-using-gpus#comments</comments>
		<pubDate>Sat, 29 Oct 2011 09:35:02 +0000</pubDate>
		<dc:creator>Mark Harris</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Pattern Matching]]></category>
		<category><![CDATA[String Matching]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=4095</guid>
		<description><![CDATA[Abstract: Pattern matching is a highly computationally intensive operation used in a plethora of applications. Unfortunately, due to the ever increasing storage capacity and link speeds, the amount of data that needs to be matched against a given set of patterns is growing rapidly. In this paper, we explore how the highly parallel computational capabilities [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>Pattern matching is a highly computationally intensive operation used in a plethora of applications. Unfortunately, due to the ever increasing storage capacity and link speeds, the amount of data that needs to be matched against a given set of patterns is growing rapidly. In this paper, we explore how the highly parallel computational capabilities of commodity graphics processing units (GPUs) can be exploited for high-speed pattern matching. We present the design, implementation, and evaluation of a pattern matching library running on the GPU, which can be used transparently by a wide range of applications to increase their overall performance. The library supports both string searching and regular expression matching on the NVIDIA CUDA architecture. We have also explored the performance impact of different types of memory hierarchies, and present solutions<br />
to alleviate memory congestion problems. The results of our performance evaluation using off-the-self graphics processors demonstrate that GPU-based pattern matching can reach tens of gigabits per second on different workloads.</p></blockquote>
<p>(Giorgos Vasiliadis, Michalis Polychronakis and Sotiris Ioannidis: <em>&#8220;Parallelization and Characterization of Pattern Matching using GPUs&#8221;</em>, Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). November 2011. [<a href="http://dcs.ics.forth.gr/Activities/papers/gpupattern.iiswc11.pdf" target="_blank">PDF</a>])</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2011/10/29/parallelization-and-characterization-of-pattern-matching-using-gpus/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SIMD Re-convergence at Thread Frontiers: A new method for handling branch divergence on GPUs</title>
		<link>http://gpgpu.org/2011/10/24/simd-re-convergence</link>
		<comments>http://gpgpu.org/2011/10/24/simd-re-convergence#comments</comments>
		<pubDate>Mon, 24 Oct 2011 08:33:32 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Computer Architecture]]></category>
		<category><![CDATA[Hardware Design]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[SIMD]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=4071</guid>
		<description><![CDATA[Abstract: Hardware and compiler techniques for mapping data-parallel programs with divergent control flow to SIMD architectures have recently enabled the emergence of new GPGPU programming models such as CUDA,  OpenCL, and DirectX Compute. The impact of branch divergence can be quite different depending upon whether the program&#8217;s control flow is structured or unstructured. In this [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>Hardware and compiler techniques for mapping data-parallel programs with divergent control flow to SIMD architectures have recently enabled the emergence of new GPGPU programming models such as CUDA,  OpenCL, and DirectX Compute. The impact of branch divergence can be quite different depending upon whether the program&#8217;s control flow is structured or unstructured. In this paper, we show that unstructured control flow occurs frequently in applications and can lead to significant code expansion when executed using existing approaches for handling branch divergence. This paper proposes a new technique for automatically mapping arbitrary control flow onto SIMD processors that relies on a concept of a &#8220;Thread Frontier&#8221;, which is a statically bounded region of the program<br />
containing all threads that have branched away from the current warp. This technique is evaluated on a GPU emulator configured to model i) a commodity GPU (Intel Sandybridge), and ii) custom hardware support not realized in current GPU architectures. It is shown that this new technique performs identically to the best existing method for structured control flow, and re-converges at the earliest possible point when executing unstructured control flow. This leads to i) between 1.5-633.2% reductions in dynamic instruction counts for several real applications, ii) simplification of the compilation process, and iii) ability to efficiently add high level unstructured programming constructs (e.g., exceptions) to existing data-parallel languages.</p></blockquote>
<p>(Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu and Sudhakar Yalamanchili: <em>&#8220;SIMD Re-convergence at Thread Frontiers&#8221;</em>. 44th International Symposium on Microarchitecture (MICRO 44), 2011. [<a href="http://www.gdiamos.net/publications.php" target="_blank">WWW</a>])</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2011/10/24/simd-re-convergence/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Efficient Synchronization Primitives for GPUs</title>
		<link>http://gpgpu.org/2011/10/22/efficient-synchronization-primitives-for-gpus</link>
		<comments>http://gpgpu.org/2011/10/22/efficient-synchronization-primitives-for-gpus#comments</comments>
		<pubDate>Sat, 22 Oct 2011 10:38:43 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Parallel Programming]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=4066</guid>
		<description><![CDATA[Abstract: In this paper, we revisit the design of synchronization primitives&#8212;specifically barriers, mutexes, and semaphores&#8212;and how they apply to the GPU. Previous implementations are insufficient due to the discrepancies in hardware and programming model of the GPU and CPU. We create new implementations in CUDA and analyze the performance of spinning on the GPU, as [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>In this paper, we revisit the design of synchronization primitives&#8212;specifically barriers, mutexes, and semaphores&#8212;and how they apply to the GPU. Previous implementations are insufficient due to the discrepancies in hardware and programming model of the GPU and CPU. We create new implementations in CUDA and analyze the performance of spinning on the GPU, as well as a method of sleeping on the GPU, by running a set of memory-system benchmarks on two of the most common GPUs in use, the Tesla- and Fermi-class GPUs from NVIDIA. From our results we define higher-level principles that are valid for generic many-core processors, the most important of which is to limit the number of atomic accesses required for a synchronization operation because atomic accesses are slower than regular memory accesses. We use the results of the benchmarks to critique existing synchronization algorithms and guide our new implementations, and then define an abstraction of GPUs to classify any GPU based on the behavior of the memory system. We use this abstraction to create suitable implementations of the primitives specifically targeting the GPU, and analyze the performance of these algorithms on Tesla and Fermi. We then predict performance on future GPUs based on characteristics of the abstraction. We also examine the roles of spin waiting and sleep waiting in each primitive and how their performance varies based on the machine abstraction, then give a set of guidelines for when each strategy is useful based on the characteristics of the GPU and expected contention.</p></blockquote>
<p>(Jeff A. Stuart and John D. Owens: <em>&#8220;Efficient Synchronization Primitives for GPUs&#8221;</em>, submitted October 2011. [<a href="http://arxiv.org/abs/1110.4623" target="_blank">ARXIV</a>]).</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2011/10/22/efficient-synchronization-primitives-for-gpus/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

