<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>GPGPU &#187; Tag: Parallel Programming :: GPGPU.org</title>
	<atom:link href="http://gpgpu.org/tag/parallel-programming/feed" rel="self" type="application/rss+xml" />
	<link>http://gpgpu.org</link>
	<description>General-Purpose Computation on Graphics Hardware</description>
	<lastBuildDate>Tue, 22 May 2012 08:44:05 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Efficient Synchronization Primitives for GPUs</title>
		<link>http://gpgpu.org/2011/10/22/efficient-synchronization-primitives-for-gpus</link>
		<comments>http://gpgpu.org/2011/10/22/efficient-synchronization-primitives-for-gpus#comments</comments>
		<pubDate>Sat, 22 Oct 2011 10:38:43 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Parallel Programming]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=4066</guid>
		<description><![CDATA[Abstract: In this paper, we revisit the design of synchronization primitives&#8212;specifically barriers, mutexes, and semaphores&#8212;and how they apply to the GPU. Previous implementations are insufficient due to the discrepancies in hardware and programming model of the GPU and CPU. We create new implementations in CUDA and analyze the performance of spinning on the GPU, as [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>In this paper, we revisit the design of synchronization primitives&#8212;specifically barriers, mutexes, and semaphores&#8212;and how they apply to the GPU. Previous implementations are insufficient due to the discrepancies in hardware and programming model of the GPU and CPU. We create new implementations in CUDA and analyze the performance of spinning on the GPU, as well as a method of sleeping on the GPU, by running a set of memory-system benchmarks on two of the most common GPUs in use, the Tesla- and Fermi-class GPUs from NVIDIA. From our results we define higher-level principles that are valid for generic many-core processors, the most important of which is to limit the number of atomic accesses required for a synchronization operation because atomic accesses are slower than regular memory accesses. We use the results of the benchmarks to critique existing synchronization algorithms and guide our new implementations, and then define an abstraction of GPUs to classify any GPU based on the behavior of the memory system. We use this abstraction to create suitable implementations of the primitives specifically targeting the GPU, and analyze the performance of these algorithms on Tesla and Fermi. We then predict performance on future GPUs based on characteristics of the abstraction. We also examine the roles of spin waiting and sleep waiting in each primitive and how their performance varies based on the machine abstraction, then give a set of guidelines for when each strategy is useful based on the characteristics of the GPU and expected contention.</p></blockquote>
<p>(Jeff A. Stuart and John D. Owens: <em>&#8220;Efficient Synchronization Primitives for GPUs&#8221;</em>, submitted October 2011. [<a href="http://arxiv.org/abs/1110.4623" target="_blank">ARXIV</a>]).</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2011/10/22/efficient-synchronization-primitives-for-gpus/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>GMAC 0.0.20 Released</title>
		<link>http://gpgpu.org/2011/02/10/gmac-0-0-20-released</link>
		<comments>http://gpgpu.org/2011/02/10/gmac-0-0-20-released#comments</comments>
		<pubDate>Fri, 11 Feb 2011 03:52:11 +0000</pubDate>
		<dc:creator>Mark Harris</dc:creator>
				<category><![CDATA[Developer Resources]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Libraries]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Parallel Programming]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=3250</guid>
		<description><![CDATA[GMAC is a user-level library that implements an Asymmetric Distributed Shared Memory model to be used by CUDA programs. An ADSM model builds a global memory space that allows CPU code to transparently access data hosted in accelerators&#8217; (GPUs&#8217;) memories. Moreover, the coherency of the data is automatically handled by the library. This removes the [...]]]></description>
			<content:encoded><![CDATA[<p>GMAC is a user-level library that implements an Asymmetric Distributed Shared Memory model to be used by CUDA programs. An ADSM model builds a global memory space that allows CPU code to transparently access data hosted in accelerators&#8217; (GPUs&#8217;) memories. Moreover, the coherency of the data is automatically handled by the library. This removes the necessity for manual memory transfers (cudaMemcpy) between the host and GPU memories. Furthermore, GMAC assigns a different &#8220;virtual GPU&#8221; to each host thread, and the virtual GPUs are evenly mapped to physical GPUs. This is especially useful for multi-GPU programs since each host thread can access the memory of all GPUs and simple GPU-to-GPU transfers can be performed with simple memcpy calls.<span id="more-3250"></span></p>
<p>GMAC is being developed by the Operating Systems Group at the Universitat Politecnica de Catalunya and the IMPACT Research Group at the Univeristy of Illinois under the University of Illinois/NCSA Open Source License.</p>
<p>Release notes for GMAC 0.0.20<br />
- Complete rewrite of the code<br />
- Added unit testing to the code<br />
- Automatic PCIe and disk I/O transfer overlapping (interposition of the fread/fwrite functions)<br />
- Automatic hostToDevice and deviceToHost overlapping in GPU to GPU transfers<br />
- Optimized version of the MPI_Sendrecv/MPI_Send/MPI_Recv functions (through interposition)</p>
<p>The project is hosted at http://code.google.com/p/adsm/. There you can find the source code and documentation.</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2011/02/10/gmac-0-0-20-released/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>rCUDA 1.0 released</title>
		<link>http://gpgpu.org/2010/04/05/rcuda-1-0-released</link>
		<comments>http://gpgpu.org/2010/04/05/rcuda-1-0-released#comments</comments>
		<pubDate>Tue, 06 Apr 2010 00:38:12 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Developer Resources]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Clusters]]></category>
		<category><![CDATA[Libraries]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Tools]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=2234</guid>
		<description><![CDATA[The GAP (Universidad Politécnica de Valencia, Spain) and HPCA (Universidad Jaume I, Spain) research groups are proud to announce the public release of rCUDA 1.0. The rCUDA Framework enables the concurrent usage of CUDA-compatible devices remotely by employing the sockets API for communication between clients and servers. Thus, it can be useful in three different [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://www.gap.upv.es" target="_blank">GAP</a> (<a href="http://www.upv.es" target="_blank">Universidad Politécnica de Valencia</a>, Spain) and <a href="http://www.hpca.uji.es" target="_blank">HPCA</a> (<a href="http://www.uji.es" target="_blank">Universidad Jaume I</a>, Spain) research groups are proud to announce the public release of rCUDA 1.0. The rCUDA Framework enables the concurrent usage of CUDA-compatible devices remotely by employing the sockets API for communication between clients and servers. Thus, it can be useful in three different environments:</p>
<ul>
<li>Clusters. To reduce the number of GPUs installed in High Performance Clusters. This leads to energy savings, as well as other related savings like acquisition costs, maintenance, space, cooling, etc.</li>
<li>Academia. In low performance networks, to offer access to a few high performance GPUs concurrently to all the students.</li>
<li>Virtual Machines. To enable the access to the CUDA facilities on the physical machine.</li>
</ul>
<p>The current version of rCUDA (v1.0) implements all functions in the CUDA Runtime API version 2.3, excluding OpenGL and Direct3D interoperability. rCUDA 1.0 targets the Linux OS (for 32- and 64-bit architectures) on both client and server sides. The framework is free for any purpose under the terms and conditions of the GNU GPL/LGPL (where applicable) licenses.</p>
<p>For additional information, visit the <a href="http://www.hpca.uji.es/?q=node/36" target="_blank">rCUDA web page</a> or <a href="http://www.gap.upv.es/~apenya" target="_blank">Antonio Peña&#8217;s webpage</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2010/04/05/rcuda-1-0-released/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Thrust v1.2 Released</title>
		<link>http://gpgpu.org/2010/03/23/thrust-v1-2-released</link>
		<comments>http://gpgpu.org/2010/03/23/thrust-v1-2-released#comments</comments>
		<pubDate>Tue, 23 Mar 2010 12:01:14 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Developer Resources]]></category>
		<category><![CDATA[C/C++]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=2218</guid>
		<description><![CDATA[Version 1.2 of Thrust, an open-source template library for developing CUDA applications, has been released. Modeled after the C++ Standard Template Library (STL), Thrust brings a familiar abstraction layer to the realm of GPU computing. This version adds several new features, including: support for multicore CPUs via OpenMP support for CUDA 3.0 and new GPUs based [...]]]></description>
			<content:encoded><![CDATA[<p>Version 1.2 of <a href="http://thrust.googlecode.com/" target="_blank">Thrust</a>, an open-source template library for developing CUDA applications, has been released. Modeled after the C++ Standard Template Library (STL), Thrust brings a familiar abstraction layer to the realm of GPU computing. This version adds several new features, including:</p>
<ul>
<li><a href="http://code.google.com/p/thrust/wiki/DeviceBackends" target="_blank">support for multicore CPUs via OpenMP</a></li>
<li>support for CUDA 3.0 and new GPUs based on the Fermi architecture</li>
<li>support for the Ocelot virtual machine</li>
<li><a href="http://thrust.googlecode.com/svn/tags/1.2.0/doc/html/group__random.html" target="_blank">pseudo random number generation</a></li>
<li><a href="http://thrust.googlecode.com/svn/tags/1.2.0/doc/html/group__reductions.html" target="_blank">key-value reduction</a></li>
<li><a href="http://thrust.googlecode.com/svn/tags/1.2.0/doc/html/group__set__operations.html" target="_blank">set intersection</a></li>
<li><a href="http://code.google.com/p/thrust/source/browse/tags/1.2.0/CHANGELOG" target="_blank">and many more</a></li>
</ul>
<p>The <a href="http://thrust.googlecode.com/" target="_blank">Thrust web page</a> provides a <a href="http://code.google.com/p/thrust/wiki/QuickStartGuide" target="_blank">quick-start guide</a>, <a href="http://gpgpu.org//code.google.com/p/thrust/wiki/Documentation" target="_blank">online documentation</a>, many <a href="http://thrust.googlecode.com/files/" target="_blank">examples</a> and introductory slides. Thrust is open-source software distributed under the <a href="http://www.opensource.org/licenses/apache2.0.php" target="_blank">OSI-approved Apache License v2.0</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2010/03/23/thrust-v1-2-released/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs</title>
		<link>http://gpgpu.org/2010/02/02/fcuda-sasp</link>
		<comments>http://gpgpu.org/2010/02/02/fcuda-sasp#comments</comments>
		<pubDate>Wed, 03 Feb 2010 00:31:15 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[FPGAs]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Static Program Analysis]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=2105</guid>
		<description><![CDATA[Abstract: As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moore&#8217;s law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moore&#8217;s law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs and FPGAs are becoming very popular in PC-based heterogeneous systems for speeding up compute intensive kernels of scientific, imaging and simulation applications. GPUs can execute hundreds of concurrent threads, while FPGAs provide customized concurrency for highly parallel kernels. However, exploiting the parallelism available in these applications is currently not a push-button task. Often the programmer has to expose the application&#8217;s fine and coarse grained parallelism by using special APIs. CUDA is such a parallel-computing API that is driven by the GPU industry and is gaining significant popularity. In this work, we adapt the CUDA programming model into a new FPGA design flow called FCUDA, which efficiently maps the coarse and fine grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SPMD CUDA thread blocks into parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multi-core accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.</p></blockquote>
<p>(Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, Jason Cong and Wen-Mei W. Hwu, <em>FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs</em>, Proceedings of the 7th Symposium on Application Specific Processors, pp.35-42, July 2009. DOI: <a href="http://dx.doi.org/10.1109/SASP.2009.5226333" target="_blank">10.1109/SASP.2009.5226333</a>)</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2010/02/02/fcuda-sasp/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>NVIDIA Introduces Nexus Integrated GPU/CPU Development Environment for Microsoft Visual Studio</title>
		<link>http://gpgpu.org/2009/10/04/nvidia-nexus-integrated-development-environment</link>
		<comments>http://gpgpu.org/2009/10/04/nvidia-nexus-integrated-development-environment#comments</comments>
		<pubDate>Sun, 04 Oct 2009 22:51:39 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Business]]></category>
		<category><![CDATA[Developer Resources]]></category>
		<category><![CDATA[Debugging]]></category>
		<category><![CDATA[NVIDIA]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Profiling]]></category>
		<category><![CDATA[Tools]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=1926</guid>
		<description><![CDATA[From the press release: NVIDIA Corp. today introduced NVIDIA® Nexus, the industry&#8217;s first development environment for massively parallel computing that is integrated into Microsoft Visual Studio, the world&#8217;s most popular development environment for Windows-based solutions and Web applications and services. &#8220;NVIDIA Nexus is going to improve programmer productivity immediately,&#8221; said Tarek El Dokor at Edge [...]]]></description>
			<content:encoded><![CDATA[<p>From the <a href="http://www.nvidia.com/object/pr_nexus_093009.html" target="_blank">press release</a>:</p>
<blockquote><p>NVIDIA Corp. today introduced NVIDIA® Nexus, the industry&#8217;s first development environment for massively parallel computing that is integrated into Microsoft Visual Studio, the world&#8217;s most popular development environment for Windows-based solutions and Web applications and services.</p>
<p>&#8220;NVIDIA Nexus is going to improve programmer productivity immediately,&#8221; said Tarek El Dokor at Edge 3 Technologies. &#8220;An integrated GPU and CPU development solution is something Edge 3 has needed for a long time. The fact that it&#8217;s integrated into the Visual Studio development environment drastically reduces the learning curve.&#8221;</p>
<p>NVIDIA Nexus radically improves productivity by enabling developers of GPU computing applications to use the popular Microsoft Visual Studio-based tools and workflow in a transparent manner, without having to create a separate version of the application that incorporates diagnostic software calls. NVIDIA Nexus also includes the ability to run the code remotely on a different computer. Nexus includes advanced tools for simultaneously analyzing efficiency, performance, and speed of both the graphics processing unit (GPU) and central processing unit (CPU) to give developers immediate insight into how co-processing affects their applications.</p>
<p>Nexus is composed of three components:</p>
<p><span id="more-1926"></span></p>
<ul>
<li>The Nexus Debugger is a source code debugger for GPU source code, such as CUDA C, HLSL and DirectCompute. It supports source breakpoints, data breakpoints and direct GPU memory inspection. All debugging is performed directly on the hardware.</li>
<li>The Nexus Analyzer is a system-wide performance tool for viewing GPU events (kernels, API calls, memory transfers) and CPU events (core allocation, threads and process events and waits)-all on a single, correlated timeline.</li>
<li> The Nexus Graphics Inspector provides developers the ability to debug and profile frames rendered using APIs such as Direct3D. Developers can use the Graphics InspectorT to scrub through draw calls, look at any textures, vertex buffers, and API state in the entire frame.</li>
</ul>
<p>The NVIDIA Nexus supports Windows 7 and Windows Vista operating systems and full integration within Visual Studio (2008 SP1 standard edition or later).</p>
<p>A BETA version of NVIDIA Nexus is scheduled to be available on Oct. 15. For more information on NVIDIA Nexus or to register as a developer, please visit: <a href="http://www.nvidia.com/nexus" target="_blank">www.nvidia.com/nexus</a>. Both standard and professional versions of NVIDIA Nexus will be available upon final release.</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2009/10/04/nvidia-nexus-integrated-development-environment/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Converging Design Features in CPUs and GPUs</title>
		<link>http://gpgpu.org/2007/01/22/converging-design-features-in-cpus-and-gpus</link>
		<comments>http://gpgpu.org/2007/01/22/converging-design-features-in-cpus-and-gpus#comments</comments>
		<pubDate>Mon, 22 Jan 2007 23:46:00 +0000</pubDate>
		<dc:creator>Mark Harris</dc:creator>
				<category><![CDATA[Business]]></category>
		<category><![CDATA[Press]]></category>
		<category><![CDATA[Parallel Programming]]></category>

		<guid isPermaLink="false">http://www.gpgpu.org/cgi-bin/blosxom.cgi/Press/papakiposHPCWire06.html</guid>
		<description><![CDATA[This article at HPC Wire by Matthew Papakipos, CTO of PeakStream Technologies, discusses the convergence of CPU and GPU architectures, the programming challenges architecture changes pose, and possible solutions to these challenges.]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.hpcwire.com/hpc/1209133.html">This article</a> at <a href="http://www.hpcwire.com">HPC Wire</a> by Matthew Papakipos, CTO of PeakStream Technologies, discusses the convergence of CPU and GPU architectures, the programming challenges architecture changes pose, and possible solutions to these challenges.</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2007/01/22/converging-design-features-in-cpus-and-gpus/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

