<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>GPGPU &#187; Tag: Parallel Algorithms :: GPGPU.org</title>
	<atom:link href="http://gpgpu.org/tag/parallel-algorithms/feed" rel="self" type="application/rss+xml" />
	<link>http://gpgpu.org</link>
	<description>General-Purpose Computation on Graphics Hardware</description>
	<lastBuildDate>Mon, 06 Feb 2012 04:59:24 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>OpenCL Parallel Primitives Library</title>
		<link>http://gpgpu.org/2011/06/03/opencl-parallel-primitives-library</link>
		<comments>http://gpgpu.org/2011/06/03/opencl-parallel-primitives-library#comments</comments>
		<pubDate>Fri, 03 Jun 2011 11:10:24 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Developer Resources]]></category>
		<category><![CDATA[Libraries]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[OpenCL]]></category>
		<category><![CDATA[Parallel Algorithms]]></category>
		<category><![CDATA[Sorting]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=3627</guid>
		<description><![CDATA[clpp is an OpenCL library of data-parallel algorithm primitives such as parallel prefix sum (&#8220;scan&#8221;), parallel sort and parallel reduction. Primitives such as these are important building blocks for a wide variety of data-parallel algorithms, including sorting, stream compaction, and building data structures such as trees and summed-area tables. For more information, visit http://code.google.com/p/clpp.]]></description>
			<content:encoded><![CDATA[<p><a href="http://code.google.com/p/clpp/" target="_blank">clpp</a> is an OpenCL library of data-parallel algorithm primitives such as parallel prefix sum (&#8220;scan&#8221;), parallel sort and parallel reduction. Primitives such as these are important building blocks for a wide variety of data-parallel algorithms, including sorting, stream compaction, and building data structures such as trees and summed-area tables. For more information, visit <a href="http://code.google.com/p/clpp/" target="_blank">http://code.google.com/p/clpp</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2011/06/03/opencl-parallel-primitives-library/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>CUDA 4.0 Release Aims to Make Parallel Programming Easier</title>
		<link>http://gpgpu.org/2011/03/01/cuda-4-0-release</link>
		<comments>http://gpgpu.org/2011/03/01/cuda-4-0-release#comments</comments>
		<pubDate>Tue, 01 Mar 2011 07:55:01 +0000</pubDate>
		<dc:creator>Mark Harris</dc:creator>
				<category><![CDATA[Business]]></category>
		<category><![CDATA[Developer Resources]]></category>
		<category><![CDATA[Press]]></category>
		<category><![CDATA[High-Performance Computing]]></category>
		<category><![CDATA[Multi-GPU]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Parallel Algorithms]]></category>
		<category><![CDATA[Parallel Computing]]></category>
		<category><![CDATA[Programming Languages]]></category>
		<category><![CDATA[Tools]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=3309</guid>
		<description><![CDATA[Today NVIDIA announced the upcoming 4.0 release of CUDA.  While most of the major CUDA releases accompanied a new GPU architecture, 4.0 is a software-only release, but that doesn&#8217;t mean there aren&#8217;t a lot of new features.  With this release, NVIDIA is aiming to lower the barrier to entry to parallel programming on GPUs, with [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://gpgpu.org/wp/wp-content/uploads/2011/01/NVLogo_2D-e1298965986472.jpg"><img class="alignright size-full wp-image-3194" title="NVLogo_2D" src="http://gpgpu.org/wp/wp-content/uploads/2011/01/NVLogo_2D-e1298965986472.jpg" alt="" width="150" height="111" /></a>Today NVIDIA announced the upcoming 4.0 release of CUDA.  While most of the major CUDA releases accompanied a new GPU architecture, 4.0 is a software-only release, but that doesn&#8217;t mean there aren&#8217;t a lot of new features.  With this release, NVIDIA is aiming to lower the barrier to entry to parallel programming on GPUs, with new features including easier multi-GPU programming, a unified virtual memory address space, the powerful Thrust C++ template library, and automatic performance analysis in the Visual Profiler tool.  Full details follow in the quoted press release below.</p>
<p><span id="more-3309"></span></p>
<blockquote><p>SANTA CLARA, CA &#8212; (Marketwire) &#8212; 02/28/2011 &#8211; NVIDIA today announced the latest version of the NVIDIA® CUDA® Toolkit for developing parallel applications using NVIDIA GPUs.</p>
<p>The NVIDIA CUDA 4.0 Toolkit was designed to make parallel programming easier, and enable more developers to port their applications to GPUs. This has resulted in three main features:</p>
<ul>
<li>NVIDIA GPUDirect™ 2.0 Technology &#8211; Offers support for peer-to-peer communication among GPUs within a single server or workstation. This enables easier and faster multi-GPU programming and application performance.</li>
<li>Unified Virtual Addressing (UVA) &#8211; Provides a single merged-memory address space for the main system memory and the GPU memories, enabling quicker and easier parallel programming.</li>
<li>Thrust C++ Template Performance Primitives Libraries &#8211; Provides a collection of powerful open source C++ parallel algorithms and data structures that ease programming for C++ developers. With Thrust, routines such as parallel sorting are 5X to 100X faster than with Standard Template Library (STL) and Threading Building Blocks (TBB).</li>
</ul>
<p>&#8220;Unified virtual addressing and faster GPU-to-GPU communication makes it easier for developers to take advantage of the parallel computing capability of GPUs,&#8221; said John Stone, senior research programmer, University of Illinois, Urbana-Champaign.</p>
<p>&#8220;Having access to GPU computing through the standard template interface greatly increases productivity for a wide range of tasks, from simple cashflow generation to complex computations with Libor market models, variable annuities or CVA adjustments,&#8221; said Peter Decrem, director of Rates Products at Quantifi. &#8221;The Thrust C++ library has lowered the barrier of entry significantly by taking care of low-level functionality like memory access and allocation, allowing the financial engineer to focus on algorithm development in a GPU-enhanced environment.&#8221;</p>
<p>The CUDA 4.0 architecture release includes a number of other key features and capabilities, including:</p>
<ul>
<li>MPI Integration with CUDA Applications &#8211; Modified MPI implementations automatically move data from and to the GPU memory over Infiniband when an application does an MPI send or receive call.</li>
<li>Multi-thread Sharing of GPUs &#8211; Multiple CPU host threads can share contexts on a single GPU, making it easier to share a single GPU by multi-threaded applications.</li>
<li>Multi-GPU Sharing by Single CPU Thread &#8211; A single CPU host thread can access all GPUs in a system. Developers can easily coordinate work across multiple GPUs for tasks such as &#8220;halo&#8221; exchange in applications.</li>
<li>New NPP Image and Computer Vision Library &#8211; A rich set of image transformation operations that enable rapid development of imaging and computer vision applications.</li>
<li>New and Improved Capabilities
<ul>
<li>Auto performance analysis in the Visual Profiler</li>
<li>New features in cuda-gdb and added support for MacOS</li>
<li>Added support for C++ features like new/delete and virtual functions</li>
<li>New GPU binary disassembler</li>
</ul>
</li>
</ul>
<p>A release candidate of CUDA Toolkit 4.0 will be available free of charge beginning March 4, 2011, by enrolling in the CUDA Registered Developer Program at: <a href="http://www.nvidia.com/paralleldeveloper" target="_blank">www.nvidia.com/paralleldeveloper</a>. The CUDA Registered Developer Program provides a wealth of tools, resources, and information for parallel application developers to maximize the potential of CUDA.</p>
<p>For more information on the features and capabilities of the CUDA Toolkit and on GPGPU applications, please visit:<a href="http://www.nvidia.com/cuda" target="_blank">www.nvidia.com/cuda</a>.</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2011/03/01/cuda-4-0-release/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Thrust v1.3 release</title>
		<link>http://gpgpu.org/2010/10/07/thrust-v1-3-release</link>
		<comments>http://gpgpu.org/2010/10/07/thrust-v1-3-release#comments</comments>
		<pubDate>Fri, 08 Oct 2010 01:25:16 +0000</pubDate>
		<dc:creator>Mark Harris</dc:creator>
				<category><![CDATA[Developer Resources]]></category>
		<category><![CDATA[Data-Parallel]]></category>
		<category><![CDATA[Libraries]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Algorithms]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=2840</guid>
		<description><![CDATA[Thrust v1.3, an open-source template library for CUDA applications, has been released. Modeled after the C++ Standard Template Library (STL), Thrust brings a familiar abstraction layer to the realm of GPU computing. Version 1.3 adds several new features, including: a state-of-the-art sorting implementation, recently featured on Slashdot. performance improvements to stream compaction and reduction robust [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://gpgpu.org/wp/wp-content/uploads/2010/10/thrust_logo-e1286501346306.png"><img class="alignright size-full wp-image-2841" title="thrust_logo" src="http://gpgpu.org/wp/wp-content/uploads/2010/10/thrust_logo-e1286501346306.png" alt="" width="200" height="79" /></a><a href="http://thrust.googlecode.com">Thrust</a> v1.3, an open-source template library for CUDA applications, has been released.  Modeled after the C++ Standard Template Library (STL), Thrust brings a familiar abstraction layer to the realm of GPU computing.</p>
<p>Version 1.3 adds several new features, including:</p>
<ul>
<li>a state-of-the-art sorting implementation, recently <a href="http://developers.slashdot.org/story/10/08/30/0133203/Sorting-Algorithm-Breaks-Giga-Sort-Barrier-With-GPUs">featured</a> on Slashdot.</li>
<li>performance improvements to stream compaction and reduction</li>
<li>robust error reporting and failure detection</li>
<li>support for CUDA 3.2 and gf104-based GPUs</li>
<li>search algorithms</li>
<li>and <a href="http://code.google.com/p/thrust/source/browse/CHANGELOG?r=2444d6c2eb30fea369b0417940d2306f8d03040c">more</a>!</li>
</ul>
<p>Get started with Thrust today!  First <a href="http://thrust.googlecode.com/files/thrust-v1.3.0.zip">download Thrust v1.3</a> and then follow the online <a href="http://code.google.com/p/thrust/wiki/QuickStartGuide">quick-start guide</a>.  Refer to the <a href="http://code.google.com/p/thrust/wiki/Documentation">online documentation</a> for a complete list of features.  Many <a href="http://thrust.googlecode.com/files/examples-v1.3.zip">concrete examples</a> and a set of <a href="http://code.google.com/p/thrust/downloads/list">introductory slides</a> are also available.<span id="more-2840"></span></p>
<p>Thrust is open-source software distributed under the <a href="http://www.opensource.org/licenses/apache2.0.php">OSI-approved</a> Apache License v2.0.</p>
<p>Acknowledgments<br />
•	Thanks to Duane Merrill for contributing a fast radix sort implementation<br />
•	Thanks to Erich Elsen for contributing an implementation of find_if<br />
•	Thanks to Andrew Corrigan for contributing changes which enable OpenMP in the absence of nvcc<br />
•	Thanks to Andrew Corrigan, Cliff Woolley, David Coeurjolly, Janick Martinez Esturo, John Bowers, Maxim Naumov, Michael Garland, and Ryuta Suzuki for bug reports<br />
•	Thanks to Cliff Woolley for help with testing</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2010/10/07/thrust-v1-3-release/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid</title>
		<link>http://gpgpu.org/2010/03/03/cyclic-reduction-multigrid</link>
		<comments>http://gpgpu.org/2010/03/03/cyclic-reduction-multigrid#comments</comments>
		<pubDate>Wed, 03 Mar 2010 06:30:37 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Numerics]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Parallel Algorithms]]></category>
		<category><![CDATA[Scientific Computing]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=2171</guid>
		<description><![CDATA[Abstract: We have previously suggested mixed precision iterative solvers specifically tailored to the iterative solution of sparse linear equation systems as they typically arise in the finite element discretization of partial differential equations. These schemes have been evaluated for a number of hardware platforms, in particular single precision GPUs as accelerators to the general purpose [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>We have previously suggested mixed precision iterative solvers specifically tailored to the iterative solution of sparse linear equation systems as they typically arise in the finite element discretization of partial differential equations. These schemes have been evaluated for a number of hardware platforms, in particular single precision GPUs as accelerators to the general purpose CPU. This paper reevaluates the situation with new mixed precision solvers that run entirely on the GPU: We demonstrate that mixed precision schemes constitute a significant performance gain over native double precision. Moreover, we present a new implementation of cyclic reduction for the parallel solution of tridiagonal systems and employ this scheme as a line relaxation smoother in our GPU-based multigrid solver. With an alternating direction implicit variant of this advanced smoother we can extend the applicability of the GPU multigrid solvers to very ill-conditioned systems arising from the discretization on anisotropic meshes, that previously had to be solved on the CPU. The resulting mixed precision schemes are always faster than double precision alone, and outperform tuned CPU solvers consistently by almost an order of magnitude.</p></blockquote>
<p>(Dominik Göddeke and Robert Strzodka: <em>&#8220;Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid&#8221;</em> , accepted in: IEEE Transactions on Parallel and Distributed Systems, Special Issue: High Performance Computing with Accelerators, Mar. 2010. <a href="http://www.mathematik.tu-dortmund.de/~goeddeke/pubs/index.html#Goeddeke_2010_CRT" target="_blank">Link</a>.)</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2010/03/03/cyclic-reduction-multigrid/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>CUDPP Users: Please Complete This Survey!</title>
		<link>http://gpgpu.org/2010/02/11/cudpp-survey</link>
		<comments>http://gpgpu.org/2010/02/11/cudpp-survey#comments</comments>
		<pubDate>Fri, 12 Feb 2010 00:56:00 +0000</pubDate>
		<dc:creator>Mark Harris</dc:creator>
				<category><![CDATA[Developer Resources]]></category>
		<category><![CDATA[Data-Parallel]]></category>
		<category><![CDATA[Libraries]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Parallel Algorithms]]></category>
		<category><![CDATA[Surveys]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=2145</guid>
		<description><![CDATA[The developers of the CUDPP (CUDA Data-Parallel Primitives) Library request that users (past and current) of the CUDPP Library fill out the CUDPP Survey.  This survey will help the CUDPP Team prioritize new development and support for existing and new features.]]></description>
			<content:encoded><![CDATA[<p>The developers of the CUDPP (CUDA Data-Parallel Primitives) Library request that users (past and current) of the CUDPP Library fill out the <a href="http://gd.is/TTJ3" target="_blank">CUDPP Survey</a>.  This survey will help the CUDPP Team prioritize new development and support for existing and new features.</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2010/02/11/cudpp-survey/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Thrust 1.1 Released</title>
		<link>http://gpgpu.org/2009/09/11/thrust-1-1-released</link>
		<comments>http://gpgpu.org/2009/09/11/thrust-1-1-released#comments</comments>
		<pubDate>Fri, 11 Sep 2009 05:51:01 +0000</pubDate>
		<dc:creator>Mark Harris</dc:creator>
				<category><![CDATA[Developer Resources]]></category>
		<category><![CDATA[Data-Parallel]]></category>
		<category><![CDATA[Libraries]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Algorithms]]></category>
		<category><![CDATA[Sorting]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=1856</guid>
		<description><![CDATA[Thrust (v1.1) is  an open-source template library for developing CUDA applications.  Modeled after the C++ Standard Template Library (STL), Thrust brings a familiar abstraction layer to the realm of GPU computing. Version 1.1 adds several new features, including: fancy iterators binary search algorithms pair and tuple types segmented scan (experimental) pinned memory support (experimental) and more! [...]]]></description>
			<content:encoded><![CDATA[<p><a title="Thrust" href="http://thrust.googlecode.com/" target="_blank">Thrust</a> (v1.1) is  an open-source template library for developing CUDA applications.  Modeled after the C++ Standard Template Library (STL), Thrust brings a familiar abstraction layer to the realm of GPU computing. Version 1.1 adds several new features, including:</p>
<ul>
<li> <a title="fancy iterators" href="http://thrust.googlecode.com/svn/tags/1.1.0/doc/html/group__fancyiterator.html" target="_blank">fancy iterators</a></li>
<li> <a title="binary search algorithms" href="http://thrust.googlecode.com/svn/tags/1.1.0/doc/html/group__binary__search.html" target="_blank">binary search algorithms</a></li>
<li> <a title="pair and tuple types" href="http://thrust.googlecode.com/svn/tags/1.1.0/doc/html/group__utility.html" target="_blank">pair and tuple types</a></li>
<li> <a title="segmented scan (experimental)" href="http://thrust.googlecode.com/svn/tags/1.1.0/doc/html/group__segmentedprefixsums.html" target="_blank">segmented scan      (experimental)</a></li>
<li> <a title="pinned memory support (experimental)" href="http://thrust.googlecode.com/svn/tags/1.1.0/doc/html/group__memory__management__classes.html" target="_blank">pinned memory      support (experimental)</a></li>
<li> and <a title="more" href="http://code.google.com/p/thrust/source/browse/tags/1.1.0/CHANGELOG" target="_blank">more</a>!</li>
</ul>
<p>To get started with Thrust, first <a style="outline-width: 0px; outline-style: initial; outline-color: initial; font-size: 13px; vertical-align: baseline; background-image: initial; background-repeat: initial; background-attachment: initial; -webkit-background-clip: initial; -webkit-background-origin: initial; background-color: transparent; text-decoration: none; color: #336699; background-position: initial initial; padding: 0px; margin: 0px; border: 0px initial initial;" title="Download Thrust" href="http://code.google.com/p/thrust/downloads/list" target="_blank">download</a> Thrust and then follow the online <a style="outline-width: 0px; outline-style: initial; outline-color: initial; font-size: 13px; vertical-align: baseline; background-image: initial; background-repeat: initial; background-attachment: initial; -webkit-background-clip: initial; -webkit-background-origin: initial; background-color: transparent; text-decoration: none; color: #336699; background-position: initial initial; padding: 0px; margin: 0px; border: 0px initial initial;" href="http://code.google.com/p/thrust/wiki/Tutorial" target="_blank">tutorial</a>.  Refer to the <a style="outline-width: 0px; outline-style: initial; outline-color: initial; font-size: 13px; vertical-align: baseline; background-image: initial; background-repeat: initial; background-attachment: initial; -webkit-background-clip: initial; -webkit-background-origin: initial; background-color: transparent; text-decoration: none; color: #336699; background-position: initial initial; padding: 0px; margin: 0px; border: 0px initial initial;" title="online documentation" href="http://code.google.com/p/thrust/wiki/Documentation" target="_blank">online documentation</a> for a complete list of features.  Many <a style="outline-width: 0px; outline-style: initial; outline-color: initial; font-size: 13px; vertical-align: baseline; background-image: initial; background-repeat: initial; background-attachment: initial; -webkit-background-clip: initial; -webkit-background-origin: initial; background-color: transparent; text-decoration: none; color: #336699; background-position: initial initial; padding: 0px; margin: 0px; border: 0px initial initial;" href="http://thrust.googlecode.com/files/examples.zip" target="_blank">concrete examples</a> and a set of <a style="outline-width: 0px; outline-style: initial; outline-color: initial; font-size: 13px; vertical-align: baseline; background-image: initial; background-repeat: initial; background-attachment: initial; -webkit-background-clip: initial; -webkit-background-origin: initial; background-color: transparent; text-decoration: none; color: #336699; background-position: initial initial; padding: 0px; margin: 0px; border: 0px initial initial;" href="http://code.google.com/p/thrust/downloads/list" target="_blank">introductory slides</a> are also available. As the following code example shows, Thrust programs are concise and readable. <span id="more-1856"></span></p>
<pre>#include &lt;thrust/host_vector.h&gt;
#include &lt;thrust/device_vector.h&gt;
#include &lt;thrust/generate.h&gt;
#include &lt;thrust/sort.h&gt;
#include &lt;cstdlib&gt;</pre>
<pre>int main(void)
{
    // generate twenty random numbers on the host
    thrust::host_vector&lt;int&gt; h_vec(20);
    thrust::generate(h_vec.begin(), h_vec.end(), rand);</pre>
<pre>    // transfer data to the device
    thrust::device_vector&lt;int&gt; d_vec = h_vec;</pre>
<pre>    // sort data on the device
    thrust::sort(d_vec.begin(), d_vec.end());</pre>
<pre>    return 0;
}</pre>
<pre></pre>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2009/09/11/thrust-1-1-released/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Efficient parallel scan algorithms for GPUs</title>
		<link>http://gpgpu.org/2009/06/24/sengupta-segscan</link>
		<comments>http://gpgpu.org/2009/06/24/sengupta-segscan#comments</comments>
		<pubDate>Thu, 25 Jun 2009 01:19:46 +0000</pubDate>
		<dc:creator>Mark Harris</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[CUDPP]]></category>
		<category><![CDATA[Data-Parallel]]></category>
		<category><![CDATA[Libraries]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Parallel Algorithms]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=1696</guid>
		<description><![CDATA[This NVIDIA technical report by Sengupta, Harris, and Garland describes the design of new parallel algorithms for scan and segmented scan on GPUs.   This paper describes the primitives included in the latest release of the CUDPP library. Abstract: Scan and segmented scan algorithms are crucial building blocks for a great many data-parallel algorithms. Segmented scan [...]]]></description>
			<content:encoded><![CDATA[<p>This <a href="http://mgarland.org/papers.html#segscan-tr" target="_blank">NVIDIA technical report</a> by Sengupta, Harris, and Garland describes the design of new parallel algorithms for scan and segmented scan on GPUs.   This paper describes the primitives included in the latest release of the <a href="http://gpgpu.org/developer/cudpp">CUDPP</a> library.</p>
<p>Abstract:</p>
<blockquote><p>Scan and segmented scan algorithms are crucial building blocks for a great many data-parallel algorithms.  Segmented scan and related primitives also provide the necessary support for the flattening transform, which allows for nested data-parallel programs to be compiled into flat data-parallel languages.  In this paper, we describe the design of efficient scan and segmented scan parallel primitives in CUDA for execution on GPUs.  Our algorithms are designed using a divide-and-conquer approach that builds all scan primitives on top of a set of primitive intra-warp scan routines.  We demonstrate that this design methodology results in routines that are simple, highly efficient, and free of irregular access patterns that lead to memory bank conflicts.  These algorithms form the basis for current and upcoming releases of the widely used CUDPP library.</p></blockquote>
<p>(S. Sengupta, M. Harris, and M. Garland. <a href="http://mgarland.org/papers.html#segscan-tr" target="_blank"><em>Efficient parallel scan algorithms for GPUs</em></a>.     NVIDIA Technical Report NVR-2008-003, December 2008)</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2009/06/24/sengupta-segscan/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Fast and Scalable List Ranking on the GPU</title>
		<link>http://gpgpu.org/2009/04/28/fast-and-scalable-list-ranking-on-the-gpu</link>
		<comments>http://gpgpu.org/2009/04/28/fast-and-scalable-list-ranking-on-the-gpu#comments</comments>
		<pubDate>Wed, 29 Apr 2009 03:27:38 +0000</pubDate>
		<dc:creator>Mark Harris</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Data-Parallel]]></category>
		<category><![CDATA[List Ranking]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Parallel Algorithms]]></category>

		<guid isPermaLink="false">http://gpgpu.org/2009/04/28/fast-and-scalable-list-ranking-on-the-gpu</guid>
		<description><![CDATA[Abstract from the paper by Rehman et al.: General purpose programming on graphics processing units (GPGPU) has received a lot of attention in the parallel computing community as it promises to offer the highest performance per dollar. While GPUs are usually used to tackle regular problems that can be easily parallelized, we describe two implementations [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract from the <a href="http://research.iiit.ac.in/~rehman/Papers/ics152-rehman.pdf" target="_blank">paper by Rehman et al.</a>:</p>
<p>General purpose programming on graphics processing units (GPGPU) has received a lot of attention in the parallel computing community as it promises to offer the highest performance per dollar. While GPUs are usually used to tackle regular problems that can be easily parallelized, we describe two implementations of List Ranking—a traditional irregular algorithm that is difficult to parallelize on such massively multi-threaded hardware. In our best implementation, we introduce a GPU-optimized, recursive version of the Helman-JaJa algorithm. Our implementation can rank a random list of 8 million elements in just over 100 milliseconds, and achieves a speedup of about 8-9 over a CPU implementation as well as a speedup of 3-4 over the best reported implementation on the Cell Broadband Engine. We also discuss some practical issues that come to the fore when working with massively multi-threaded architectures, especially for algorithms with highly irregular memory access patterns. (M. Suhail Rehman, K. Kothapalli, P.J. Narayanan. <a href="http://research.iiit.ac.in/~rehman/Papers/ics152-rehman.pdf" target="_blank">Fast and Scalable List Ranking on the GPU</a>. 23rd International Conference on Supercomputing (ICS). New York, USA, June 2009. (To Appear))</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2009/04/28/fast-and-scalable-list-ranking-on-the-gpu/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Designing Efficient Sorting Algorithms for Manycore GPUs</title>
		<link>http://gpgpu.org/2009/03/01/designing-efficient-sorting-algorithms-for-manycore-gpus</link>
		<comments>http://gpgpu.org/2009/03/01/designing-efficient-sorting-algorithms-for-manycore-gpus#comments</comments>
		<pubDate>Sun, 01 Mar 2009 22:25:52 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Data-Parallel]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Parallel Algorithms]]></category>
		<category><![CDATA[Sorting]]></category>

		<guid isPermaLink="false">http://www.gpgpu.org/newgpgpu/?p=1187</guid>
		<description><![CDATA[This IPDPS 2009 paper by Nadathur Satish, Mark Harris, and Michael Garland describes the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by NVIDIA CUDA. The radix sort described is the fastest GPU sort and the merge sort described is the fastest comparison-based [...]]]></description>
			<content:encoded><![CDATA[<p>This <a href="http://www.ipdps.org/" target="_blank">IPDPS</a> 2009 paper by <a href="http://www.eecs.berkeley.edu/~nrsatish/" target="_blank">Nadathur Satish</a>, <a href="http://www.markmark.net" target="_blank">Mark Harris</a>, and <a href="http://mgarland.org/home.html" target="_blank">Michael Garland</a> describes the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by <a href="http://www.nvidia.com/cuda" target="_blank">NVIDIA CUDA</a>. The radix sort described is the fastest GPU sort and the merge sort described is the fastest comparison-based GPU sort reported in the literature. The radix sort is up to 4 times faster than the graphics-based GPUSort and greater than 2 times faster than other CUDA-based radix sorts. It is also 23% faster, on average, than even a very carefully optimized multicore CPU sorting routine. To achieve this performance, the authors carefully design the algorithms to expose substantial fine-grained parallelism and decompose the computation into independent tasks that perform minimal global communication. They exploit the high-speed on-chip shared memory provided by NVIDIA’s GPU architecture and efficient data-parallel primitives, particularly parallel scan. While targeted at GPUs, these algorithms should also be well-suited for other manycore processors. (N. Satish, M. Harris, and M. Garland. <a href="http://mgarland.org/papers.html" target="_blank">Designing efficient sorting algorithms for manycore GPUs</a>. Proc. 23rd IEEE Int’l Parallel &amp; Distributed Processing Symposium, May 2009. To appear.)</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2009/03/01/designing-efficient-sorting-algorithms-for-manycore-gpus/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ph.D. Dissertation: Glift Generic GPU Data Structures, by Aaron Lefohn</title>
		<link>http://gpgpu.org/2007/01/18/phd-dissertation-glift-generic-gpu-data-structures-by-aaron-lefohn</link>
		<comments>http://gpgpu.org/2007/01/18/phd-dissertation-glift-generic-gpu-data-structures-by-aaron-lefohn#comments</comments>
		<pubDate>Thu, 18 Jan 2007 13:44:00 +0000</pubDate>
		<dc:creator>Mark Harris</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Data Structures]]></category>
		<category><![CDATA[Dissertations]]></category>
		<category><![CDATA[Parallel Algorithms]]></category>

		<guid isPermaLink="false">http://www.gpgpu.org/cgi-bin/blosxom.cgi/DataParallelAlgorithms/lefohnThesis06.html</guid>
		<description><![CDATA[This Ph.D. dissertation by Aaron Lefohn at the University of California, Davis describes the Glift GPU data structure abstraction and its application to both GPU-based data-parallel and interactive rendering algorithms. The applications include octree 3D painting, adaptive shadow maps, resolution matched shadow maps, heat-diffusion depth-of-field, and a GPU-based direct solver for tridiagonal linear systems. While [...]]]></description>
			<content:encoded><![CDATA[<p>This <a href="http://graphics.cs.ucdavis.edu/~lefohn/work/dissertation/">Ph.D. dissertation</a> by <a href="http://graphics.cs.ucdavis.edu/~lefohn/" title="Aaron Lefohn" target="_blank">Aaron Lefohn</a> at the<a href="http://graphics.cs.ucdavis.edu/"> University of California, Davis</a> describes the Glift GPU data structure abstraction and its application to both GPU-based data-parallel and interactive rendering algorithms. The applications include octree 3D painting, adaptive shadow maps, resolution matched shadow maps, heat-diffusion depth-of-field, and a GPU-based direct solver for tridiagonal linear systems. While much of this work has been posted previously, this dissertation contains a more in-depth discussion of the Glift data structure library and introduces several GPGPU and rendering algorithms that are not yet published. This dissertation demonstrates that a data structure abstraction for GPUs can simplify the description of new and existing data structures, stimulate development of complex GPU algorithms, and perform equivalently to hand-coded implementations. The dissertation also presents a case that future interactive rendering solutions will be an inseparable mix of general-purpose, data-parallel algorithms and traditional graphics programming. (<a href="http://graphics.cs.ucdavis.edu/~lefohn/">Aaron Lefohn</a>, <a href="http://graphics.cs.ucdavis.edu/~lefohn/work/dissertation/" target="_blank" title="Glift Dissertation">&#8220;Glift: Generic Data Structures for Graphics Hardware&#8221;</a>, Ph.D. dissertation, Computer Science Department, <a href="http://graphics.cs.ucdavis.edu/" target="_blank" title="UC Davis">University of California Davis</a>, September 2006.)</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2007/01/18/phd-dissertation-glift-generic-gpu-data-structures-by-aaron-lefohn/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

