<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>GPGPU &#187; Tag: Multicore :: GPGPU.org</title>
	<atom:link href="http://gpgpu.org/tag/multicore/feed" rel="self" type="application/rss+xml" />
	<link>http://gpgpu.org</link>
	<description>General-Purpose Computation on Graphics Hardware</description>
	<lastBuildDate>Tue, 22 May 2012 08:44:05 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Enhanced Parallel ILU(p)-based Preconditioners for Multi-core CPUs and GPUs &#8212; The Power(q)-pattern Method</title>
		<link>http://gpgpu.org/2011/07/08/parallel-ilup-based-preconditioners</link>
		<comments>http://gpgpu.org/2011/07/08/parallel-ilup-based-preconditioners#comments</comments>
		<pubDate>Fri, 08 Jul 2011 07:19:38 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Multicore]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Sparse Linear Systems]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=3741</guid>
		<description><![CDATA[Abstract: Application demands and grand challenges in numerical simulation require for both highly capable computing platforms and efficient numerical solution schemes. Power constraints and further miniaturization of modern and future hardware give way for multi- and manycore processors with increasing fine-grained parallelism and deeply nested hierarchical memory systems &#8212; as already exemplified by recent graphics [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>Application demands and grand challenges in numerical simulation require for both highly capable computing platforms and efficient numerical solution schemes. Power constraints and further miniaturization of modern and future hardware give way for multi- and manycore processors with increasing fine-grained parallelism and deeply nested hierarchical memory systems &#8212; as already exemplified by recent graphics processing units. Accordingly, numerical schemes need to be adapted and re-engineered in order to deliver scalable solutions across diverse processor configurations. Portability of parallel software solutions across emerging hardware platforms is another challenge. This work investigates multi-coloring and re-ordering schemes for block Gauss-Seidel methods and, in particular, for incomplete LU factorizations with and without fill-ins. We consider two matrix re-ordering schemes that deliver flexible and efficient parallel preconditioners. The general idea is to generate block decompositions of the system matrix such that the diagonal blocks are diagonal itself. In such a way, parallelism can be exploited on the block-level in a scalable manner. Our goal is to provide widely applicable, out-of-the-box preconditioners that can be used in the context of finite element solvers.</p>
<p>We propose a new method for anticipating the fill-in pattern of ILU(p) schemes which we call the power(q)-pattern method. This method is based on an incomplete factorization of the system matrix A subject to a predetermined pattern given by the matrix power |A|<sup>p+1</sup> and its associated multi-coloring permutation pi. We prove that the obtained sparsity pattern is a superset of our modified ILU(p) factorization applied to pi A p<sup>i-1</sup>. As a result, this modified ILU(p) applied to multi-colored system matrix has no fill-ins in its diagonal blocks. This leads to an inherently parallel execution of triangular ILU(p) sweeps.</p>
<p>In addition, we describe the integration of the preconditioners into the HiFlow<sup>3</sup> open-source finite element package that provides a portable software solution across diverse hardware platforms. On this basis, we conduct performance analysis across a variety of test problems on multi-core CPUs and GPUs that proves efficiency, scalability and flexibility of our approach. Our preconditioners achieve a solver acceleration by a factor of up to 1.5, 8 and 85 for three different test problems. The GPU versions of the preconditioned solver are by a factor of up to 4 faster than an OpenMP parallel version on eight cores.</p></blockquote>
<p>(Vincent Heuveline, Dimitar Lukarski and Jan-Philipp Weiss: &#8220;Enhanced Parallel ILU(p)-based Preconditioners for Multi-core CPUs and GPUs &#8212; The Power(q)-pattern Method&#8221;, EMCL Preprint Series, number 08, July 2011 [<a href="http://www.emcl.kit.edu/preprints/emcl-preprint-2011-08.pdf" target="_blank">PDF</a>])</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2011/07/08/parallel-ilup-based-preconditioners/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Workshop on GPU Programming for Molecular Modeling, August 6-8,2010, University of Illinois</title>
		<link>http://gpgpu.org/2010/06/18/workshop-gpu-molecular-modeling</link>
		<comments>http://gpgpu.org/2010/06/18/workshop-gpu-molecular-modeling#comments</comments>
		<pubDate>Fri, 18 Jun 2010 23:01:45 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Developer Resources]]></category>
		<category><![CDATA[Events]]></category>
		<category><![CDATA[Molecular Dynamics]]></category>
		<category><![CDATA[Multicore]]></category>
		<category><![CDATA[Tutorials & Courses]]></category>
		<category><![CDATA[Workshops]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=2469</guid>
		<description><![CDATA[The Theoretical and Computational Biophysics Group, NIH Resource for Macromolecular Modeling and Bioinformatics (www.ks.uiuc.edu) at the University of Illinois at Urbana-Champaign, presents a Workshop on GPU Programming for Molecular Modeling to be held August 6-8, 2010, at the Beckman Institute for Advanced Science and Technology, on the University of Illinois campus in Urbana, Illinois, USA. [...]]]></description>
			<content:encoded><![CDATA[<div id="attachment_2477" class="wp-caption alignright" style="width: 218px"><a href="http://gpgpu.org/wp/wp-content/uploads/2010/06/riboions-small.jpg"><img class="size-full wp-image-2477 " title="riboions-small" src="http://gpgpu.org/wp/wp-content/uploads/2010/06/riboions-small.jpg" alt="GPU-Accelerated Ion Placement" width="208" height="181" /></a><p class="wp-caption-text">GPU-Accelerated Ion Placement</p></div>
<p>The Theoretical and Computational Biophysics Group, NIH Resource for Macromolecular Modeling and Bioinformatics (www.ks.uiuc.edu) at the University of Illinois at Urbana-Champaign, presents a <a href="http://www.ks.uiuc.edu/Training/Workshop/GPU_Aug2010/" target="_blank">Workshop on GPU Programming for Molecular Modeling</a> to be held August 6-8, 2010, at the Beckman Institute for Advanced Science and Technology, on the University of Illinois campus in Urbana, Illinois, USA. Application, selection, and notification of participants is on-going through July 29, 2010.</p>
<p>Note:  Participants are encouraged to attend the multi-site <a href="https://www.vscse.org/summerschool/2010/manycore.html" target="_blank">&#8220;Proven Algorithmic Techniques for Many-core Processors&#8221; workshop</a> the preceding week (August 2-6) at the location of their choice. Registration for this workshop is required for participants without equivalent GPU-programming training or experience.</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2010/06/18/workshop-gpu-molecular-modeling/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>&#8220;Believe it or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!&#8221;</title>
		<link>http://gpgpu.org/2010/05/30/ibm-rc24982</link>
		<comments>http://gpgpu.org/2010/05/30/ibm-rc24982#comments</comments>
		<pubDate>Sun, 30 May 2010 21:40:08 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Image Processing]]></category>
		<category><![CDATA[Multicore]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=2339</guid>
		<description><![CDATA[Abstract: In this work, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. The algorithm processes individual images represented as 2-dimensional matrices of single-precision floating-point values using O(n^4) operations involving dot-products and additions. We implement this algorithm on a nVidia GTX [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p>Abstract:</p>
<p>In this work, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. The algorithm processes individual images represented as  2-dimensional matrices of single-precision floating-point values using  O(n^4) operations involving  dot-products and additions.  We implement this algorithm on a nVidia  GTX 285 GPU using CUDA, and also parallelize it for the Intel Xeon  (Nehalem) and IBM Power7 processors, using both manual and automatic  techniques. Pthreads and OpenMP with SSE and VSX vector intrinsics  are used for the manually parallelized version, while a state-of-the-art optimization framework based on the polyhedral  model is used for automatic compiler parallelization and  optimization. The performance of this algorithm on the nVidia GPU  suffers from: (1) a smaller shared memory, (2) unaligned device memory access patterns, (3) expensive atomic operations, and (4)  weaker single-thread performance. On commodity multi-core  processors, the application dataset is small enough to fit in caches, and when parallelized using a combination of task and  short-vector data parallelism (via SSE/VSX) or through fully  automatic optimization from the compiler, the application matches or  beats the performance of the GPU version. The primary reasons for better multi-core performance include larger and faster caches,  higher clock frequency, higher on-chip memory bandwidth, and better  compiler optimization and support for parallelization. The best performing versions on the Power7, Nehalem, and GTX 285 run in  1.02s, 1.82s, and 1.75s, respectively. These results conclusively  demonstrate that, under certain conditions, it is possible for a FLOP-intensive structured application running on a multi-core processor to match or even beat the performance of an equivalent GPU version.</p>
<p>(Rajesh Bordawekar and Uday Bondhugula and Ravi Rao: <em>&#8220;Believe It or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!&#8221;</em>. <a href="http://domino.watson.ibm.com/library/CyberDig.nsf/1e4115aea78b6e7c85256b360066f0d4/9192e6536facfcef85257720005a0265!OpenDocument&#038;Highlight=0,Bordawekar" target="_blank">Technical Report RC24982</a>, IBM Thomas J. Watson Research Center, Apr. 2010.)</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2010/05/30/ibm-rc24982/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Lattice-Boltzmann Simulation of the Shallow-Water Equations with Fluid-Structure Interaction on Multi- and Manycore Processors</title>
		<link>http://gpgpu.org/2010/02/28/lattice-boltzmann-shallow-water-equations</link>
		<comments>http://gpgpu.org/2010/02/28/lattice-boltzmann-shallow-water-equations#comments</comments>
		<pubDate>Mon, 01 Mar 2010 00:17:49 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Cell BE]]></category>
		<category><![CDATA[Fluid Simulation]]></category>
		<category><![CDATA[Multicore]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=2162</guid>
		<description><![CDATA[Abstract: We present an efficient method for the simulation of laminar fluid flows with free surfaces including their interaction with moving rigid bodies, based on the two-dimensional shallow water equations and the Lattice-Boltzmann method. Our implementation targets multiple fundamentally different architectures such as commodity multicore CPUs with SSE, GPUs, the Cell BE and clusters. We [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>We present an efficient method for the simulation of laminar fluid flows with free surfaces including their interaction with moving rigid bodies, based on the two-dimensional shallow water equations and the Lattice-Boltzmann method. Our implementation targets multiple fundamentally different architectures such as commodity multicore CPUs with SSE, GPUs, the Cell BE and clusters. We show that our code scales well on an MPI-based cluster; that an eightfold speedup can be achieved using modern GPUs in contrast to multithreaded CPU code and, finally, that it is possible to solve fluid-structure interaction scenarios with high resolution at interactive rates.</p></blockquote>
<p>(Markus Geveler, Dirk Ribbrock, Dominik Göddeke and Stefan Turek: <em>&#8220;Latti</em><em>ce-Boltzmann Simulation of the Shallow-Water Equations with Fluid-Structure Interaction on Multi- and Manycore Processors&#8221;</em>, Accepted in: Facing the Multicore Challenge, Heidelberg, Germany, Mar. 2010. <a href="http://www.mathematik.tu-dortmund.de/~goeddeke/pubs/index.html#Geveler_2010_LBS" target="_blank">Link</a>.)</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2010/02/28/lattice-boltzmann-shallow-water-equations/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>HONEI: A collection of libraries for numerical computations targeting multiple processor architectures</title>
		<link>http://gpgpu.org/2010/02/02/honei-cpc</link>
		<comments>http://gpgpu.org/2010/02/02/honei-cpc#comments</comments>
		<pubDate>Wed, 03 Feb 2010 00:44:45 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Developer Resources]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Cell BE]]></category>
		<category><![CDATA[Fluid Simulation]]></category>
		<category><![CDATA[Meta-programming]]></category>
		<category><![CDATA[Multicore]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=2108</guid>
		<description><![CDATA[Abstract: We present HONEI, an open-source collection of libraries offering a hardware oriented approach to numerical calculations. HONEI abstracts the hardware, and applications written on top of HONEI can be executed on a wide range of computer architectures such as CPUs, GPUs and the Cell processor. We demonstrate the flexibility and performance of our approach [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>We present HONEI, an open-source collection of libraries offering a hardware oriented approach to numerical calculations. HONEI abstracts the hardware, and applications written on top of HONEI can be executed on a wide range of computer architectures such as CPUs, GPUs and the Cell processor. We demonstrate the flexibility and performance of our approach with two test applications, a Finite Element multigrid solver for the Poisson problem and a robust and fast simulation of shallow water waves. By linking against HONEI&#8217;s libraries, we achieve a two-fold speedup over straight forward C++ code using HONEI&#8217;s SSE backend, and additional 3&#8211;4 and 4&#8211;16 times faster execution on the Cell and a GPU. A second important aspect of our approach is that the full performance capabilities of the hardware under consideration can be exploited by adding optimised application-specific operations to the HONEI libraries. HONEI provides all necessary infrastructure for development and evaluation of such kernels, significantly simplifying their development.</p></blockquote>
<p>(Danny van Dyk, Markus Geveler, Sven Mallach, Dirk Ribbrock, <a href="http://www.mathematik.tu-dortmund.de/~goeddeke" target="_blank">Dominik Göddeke</a> and Carsten Gutwenger: <em>HONEI: A collection of libraries for numerical computations targeting multiple processor architectures</em>. Computer Physics Communications 180(12), pp. 2534-2543, December 2009. DOI <a href="http://dx.doi.org/10.1016/j.cpc.2009.04.018" target="_blank">10.1016/j.cpc.2009.04.018</a>)</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2010/02/02/honei-cpc/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>JVSP Special Issue on Multicore Enabled Multimedia Applications &amp; Architectures</title>
		<link>http://gpgpu.org/2007/07/17/jvsp-special-issue-on-multicore-enabled-multimedia-applications-architectures</link>
		<comments>http://gpgpu.org/2007/07/17/jvsp-special-issue-on-multicore-enabled-multimedia-applications-architectures#comments</comments>
		<pubDate>Tue, 17 Jul 2007 19:16:00 +0000</pubDate>
		<dc:creator>Mark Harris</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Chip Multiprocessors]]></category>
		<category><![CDATA[Journals]]></category>
		<category><![CDATA[Multicore]]></category>
		<category><![CDATA[Parallel Processing]]></category>

		<guid isPermaLink="false">http://www.gpgpu.org/cgi-bin/blosxom.cgi/Conferences/vlsiMultiCore07.html</guid>
		<description><![CDATA[The trend of multicore processors development brings a shift of paradigm in applications development. Traditionally, increasing clock frequency is one of the main dimensions for conventional processors to achieve higher performance gains. Application developers used to improve performance of their applications by just waiting for faster processor platforms. Today, increasing clock frequency has reached a [...]]]></description>
			<content:encoded><![CDATA[<p>The trend of multicore processors development brings a shift of paradigm in applications development. Traditionally, increasing clock frequency is one of the main dimensions for conventional processors to achieve higher performance gains. Application developers used to improve performance of their applications by just waiting for faster processor platforms. Today, increasing clock frequency has reached a point of diminishing returnsâ€”and even negative returns if power is taken into account. Multicore processors, also known as Chip multiprocessors (CMPs), promise a power-efficiency way to increase performance and become more prevalent in vendors&#8217; solutions, for example, IBM CELL Broadband Engine processors, Intel Core 2 Dual processors, and so on. However, the application or algorithm development process must be significantly changed in order to fully explore the potential of multicore processors.  This special issue of the Journal of VLSI Signal Processing Systems is to discuss related challenges, issues, case studies, and solutions,  especially focusing on multimedia-related applications, architectures,  and programming environments, for example, understanding the complexity  of developing a new application or porting an existing application onto  a multicore processor. (<a href="http://www.geocities.com/ykchen913/jvsp_multicore.htm" title="JVSP Call for Papers">Call for papers</a>)</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2007/07/17/jvsp-special-issue-on-multicore-enabled-multimedia-applications-architectures/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Workshop: Data-Parallel Programming Models for Many-Core Architectures</title>
		<link>http://gpgpu.org/2007/03/07/workshop-this-weekend-data-parallel-programming-models-for-many-core-architectures</link>
		<comments>http://gpgpu.org/2007/03/07/workshop-this-weekend-data-parallel-programming-models-for-many-core-architectures#comments</comments>
		<pubDate>Wed, 07 Mar 2007 17:57:00 +0000</pubDate>
		<dc:creator>Mark Harris</dc:creator>
				<category><![CDATA[Events]]></category>
		<category><![CDATA[Conferences]]></category>
		<category><![CDATA[Data-Parallel]]></category>
		<category><![CDATA[Many-core]]></category>
		<category><![CDATA[Multicore]]></category>
		<category><![CDATA[Parallel Computing]]></category>
		<category><![CDATA[Workshops]]></category>

		<guid isPermaLink="false">http://www.gpgpu.org/cgi-bin/blosxom.cgi/Conferences/cgo2007.html</guid>
		<description><![CDATA[Data-parallel programming models are emerging as an extremely attractive model for parallel programming, driven by several factors. Through deterministic semantics and constrained synchronization mechanisms, they provide race-free parallel-programming semantics. Furthermore, data-parallel programming models free programmers from reasoning about the details of the underlying hardware and software mechanisms for achieving parallel execution and facilitate effective compilation. [...]]]></description>
			<content:encoded><![CDATA[<p>Data-parallel programming models are emerging as an extremely attractive model for parallel programming, driven by several factors. Through deterministic semantics and constrained synchronization mechanisms, they provide race-free parallel-programming semantics. Furthermore, data-parallel programming models free programmers from reasoning about the details of the underlying hardware and software mechanisms for achieving parallel execution and facilitate effective compilation. Finally, efforts in the GPGPU movement and elsewhere have matured implementation technologies for streaming and data-parallel programming models to the point where high performance can be reliably achieved.</p>
<p>This workshop gathers commercial and academic researchers, vendors, and users of data-parallel programming platforms to discuss implementation experience for a broad range of many-core architectures and to speculate on future programming-model directions. Participating institutions include AMD, Electronic Arts, Intel, Microsoft, NVIDIA, PeakStream, RapidMind, and The University of New South Wales. (<a href="http://groups.google.com/group/dataparallel/web/cfp-cgo-workshop-2007?hl=en" target="_blank">Link to Call for Participation, Data-Parallel Programming Models for Many-Core Architectures</a>)</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2007/03/07/workshop-this-weekend-data-parallel-programming-models-for-many-core-architectures/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

