<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>GPGPU &#187; Tag: Sparse Linear Systems :: GPGPU.org</title>
	<atom:link href="http://gpgpu.org/tag/sparse-linear-systems/feed" rel="self" type="application/rss+xml" />
	<link>http://gpgpu.org</link>
	<description>General-Purpose Computation on Graphics Hardware</description>
	<lastBuildDate>Tue, 22 May 2012 08:44:05 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Adaptive Row-Grouped CSR Format For Storing of Sparse Matrices on GPU</title>
		<link>http://gpgpu.org/2012/04/01/adaptive-row-grouped-csr-sparse-matrices</link>
		<comments>http://gpgpu.org/2012/04/01/adaptive-row-grouped-csr-sparse-matrices#comments</comments>
		<pubDate>Mon, 02 Apr 2012 02:27:53 +0000</pubDate>
		<dc:creator>Mark Harris</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Linear Algebra]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Sparse Linear Systems]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=4607</guid>
		<description><![CDATA[Abstract: We present a new adaptive format for storing sparse matrices on GPU. We compare it with several other formats including CUSPARSE which is today probably the best choice for processing of sparse matrices on GPU in CUDA. Contrary to CUSPARSE which works with common CSR format, our new format requires conversion. However, multiplication of [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>We present a new adaptive format for storing sparse matrices on GPU. We compare it with several other formats including CUSPARSE which is today probably the best choice for processing of sparse matrices on GPU in CUDA. Contrary to CUSPARSE which works with common CSR format, our new format requires conversion. However, multiplication of sparse-matrix and vector is significantly faster for many matrices. We demonstrate it on a set of 1600 matrices and we show for what types of matrices our format is profitable.</p></blockquote>
<p>(Heller M., Oberhuber T., <em>“Adaptive Row-Grouped CSR Format For Storing of Sparse Matrices on GPU“</em>, preprint on Arxiv.org 2012, [<a href="http://arxiv.org/pdf/1203.5737">PDF</a>])</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2012/04/01/adaptive-row-grouped-csr-sparse-matrices/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Image segmentation using CUDA implementations of the Runge-Kutta-Merson and GMRES methods</title>
		<link>http://gpgpu.org/2012/03/18/image-segmentation-cuda-runge-kutta-merson-gmres</link>
		<comments>http://gpgpu.org/2012/03/18/image-segmentation-cuda-runge-kutta-merson-gmres#comments</comments>
		<pubDate>Mon, 19 Mar 2012 00:30:55 +0000</pubDate>
		<dc:creator>Mark Harris</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Computer Vision]]></category>
		<category><![CDATA[Image Processing]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Sparse Linear Systems]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=4588</guid>
		<description><![CDATA[Abstract: Modern GPUs are well suited for performing image processing tasks. We utilize their high computational performance and memory bandwidth for image segmentation purposes. We segment cardiac MRI data by means of numerical solution of an anisotropic partial differential equation of the Allen-Cahn type. We implement two different algorithms for solving the equation on the CUDA architecture. One of [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>Modern GPUs are well suited for performing image processing tasks. We utilize their high computational performance and memory bandwidth for image segmentation purposes. We segment cardiac MRI data by means of numerical solution of an anisotropic partial differential equation of the Allen-Cahn type. We implement two different algorithms for solving the equation on the CUDA architecture. One of them is based on the Runge-Kutta-Merson method for the approximation of solutions of ordinary differential equations, the other uses the GMRES method for the numerical solution of systems of linear equations. In our experiments, the CUDA implementations of both algorithms are about 3–9 times faster than corresponding 12-threaded OpenMP implementations.</p></blockquote>
<p>(Oberhuber T., Suzuki A., Vacata J., Žabka V., <em>&#8220;Image segmentation using CUDA implementations of the Runge-Kutta-Merson and GMRES methods</em>&#8220;, Journal of Math-for-Industry, 2011, vol. 3, pp. 73–79 [<a href="http://geraldine.fjfi.cvut.cz/~oberhuber/data/vyzkum/publikace/11-oberhuber-suzuki-vacata-zabka-image-segmentation-in-cuda.pdf">PDF</a>])</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2012/03/18/image-segmentation-cuda-runge-kutta-merson-gmres/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Compressed Multiple-Row Storage Format</title>
		<link>http://gpgpu.org/2012/03/16/compressed-multiple-row-storage-format</link>
		<comments>http://gpgpu.org/2012/03/16/compressed-multiple-row-storage-format#comments</comments>
		<pubDate>Fri, 16 Mar 2012 06:11:39 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Sparse Linear Systems]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=4578</guid>
		<description><![CDATA[Abstract: A new format for storing sparse matrices is proposed for efficient sparse matrix-vector (SpMV) product calculation on modern throughput-oriented computer architectures. This format extends the standard compressed row storage (CRS) format and is easily convertible to and from it without any memory overhead. Computational performance of an SpMV kernel for the new format is [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>A new format for storing sparse matrices is proposed for efficient sparse matrix-vector (SpMV) product calculation on modern throughput-oriented computer architectures. This format extends the standard compressed row storage (CRS) format and is easily convertible to and from it without any memory overhead. Computational performance of an SpMV kernel for the new format is determined for over 140 sparse matrices on two Fermi-class graphics processing units (GPUs) and the efficiency of the kernel, which peaks at 36 and 25 GFLOPS at single and double precision, respectively, is compared with that of five existing generic algorithms and industrial implementations. The efficiency of the new format is also measured as a function of the mean (mu) and of the standard deviation (sigma) of the number of matrix nonzero elements per row. The largest speedup is found for matrices with mu &gt; 20 and mu &gt; sigma &gt; 1.5 and can be as high as 43%.</p></blockquote>
<p>(Zbigniew Koza, Maciej Matyka, Sebastian Szkoda, Łukasz Mirosław: <em>&#8220;Compressed Multiple-Row Storage Format&#8221;</em>, Preprint, 2012. [<a title="Link to paper on arXiv.org" href="http://arxiv.org/abs/1203.2946" target="_blank">arXiv</a>])</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2012/03/16/compressed-multiple-row-storage-format/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>New Row-grouped CSR format for storing the sparse matrices on GPU with implementation in CUDA</title>
		<link>http://gpgpu.org/2012/03/14/row-grouped-csr-sparse-matrices-cuda</link>
		<comments>http://gpgpu.org/2012/03/14/row-grouped-csr-sparse-matrices-cuda#comments</comments>
		<pubDate>Wed, 14 Mar 2012 07:05:21 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Sparse Linear Systems]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=4568</guid>
		<description><![CDATA[Abstract: A new format for storing sparse matrices is suggested. It is designed to perform well mainly on GPU devices. Its implementation in CUDA is presented. Its performance is tested on 1600 different types of matrices. This format is compared in detail with a hybrid format, and strong and weak points of both formats are [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>A new format for storing sparse matrices is suggested. It is designed to perform well mainly on GPU devices. Its implementation in CUDA is presented. Its performance is tested on 1600 different types of matrices. This format is compared in detail with a hybrid format, and strong and weak points of both formats are shown.</p></blockquote>
<p>(Oberhuber T., Suzuki A., Vacata J.: <em>&#8220;New Row-grouped CSR format for storing the sparse matrices on GPU with implementation in CUDA&#8221;</em>, Acta Technica 56: 447-466, 2011 [<a href="http://geraldine.fjfi.cvut.cz/~oberhuber/data/vyzkum/publikace/11-oberhuber-suzuki-vacata-rgcsr-format-in-cuda.pdf" target="_blank">PDF</a>])</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2012/03/14/row-grouped-csr-sparse-matrices-cuda/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Parallel Sparse Linear Algebra for Multi-core and Many-core Platforms &#8212; Parallel Solvers and Preconditioners</title>
		<link>http://gpgpu.org/2012/03/02/lukarski-phd</link>
		<comments>http://gpgpu.org/2012/03/02/lukarski-phd#comments</comments>
		<pubDate>Fri, 02 Mar 2012 06:52:27 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Dissertations]]></category>
		<category><![CDATA[Iterative Solvers]]></category>
		<category><![CDATA[Numerical Algorithms]]></category>
		<category><![CDATA[Sparse Linear Systems]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=4545</guid>
		<description><![CDATA[Abstract: Partial differential equations are typically solved by means of finite difference, finite volume or finite element methods resulting in large, highly coupled, ill-conditioned and sparse (non-)linear systems. In order to minimize the computing time we want to exploit the capabilities of modern parallel architectures. The rapid hardware shifts from single core to multi-core and [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>Partial differential equations are typically solved by means of finite difference, finite volume or finite element methods resulting in large, highly coupled, ill-conditioned and sparse (non-)linear systems. In order to minimize the computing time we want to exploit the capabilities of modern parallel architectures. The rapid hardware shifts from single core to multi-core and many-core processors lead to a gap in the progression of algorithms and programming environments for these platforms &#8212; the parallel models for large clusters do not fully utilize the performance capability of the multi-core CPUs and especially of the GPUs. Software stack needs to run adequately on the next generation of computing devices in order to exploit the potential of these new systems. Moving numerical software from one platform to another becomes an important task since every parallel device has its own programming model and language. The greatest challenge is to provide new techniques for solving (non-)linear systems that combine scalability, portability, fine-grained parallelism and flexibility across the assortment of parallel platforms and programming models. The goal of this thesis is to provide new fine-grained parallel algorithms embedded in advanced sparse linear algebra solvers and preconditioners on the emerging multi-core and many-core technologies.</p>
<p><span id="more-4545"></span>&nbsp;</p>
<p>With respect to the mathematical methods, we focus on efficient iterative linear solvers. Here, we consider two types of solvers &#8212; out-of-the-box solvers such as preconditioned Krylov subspace solvers (e.g. CG, BiCGStab, GMRES), and problem-aware solvers such as geometric matrix-based multi-grid methods. Clearly, the majority of the solvers can be written in terms of sparse matrix-vector and vector-vector operations which can be performed in parallel. Our aim is to provide parallel, generic and portable preconditioners which are suitable for multi-core and many-core devices. We focus on additive (e.g.~Gauss-Seidel, SOR), multiplicative (ILU factorization with or without fill-ins) and approximate inverse preconditioners. The preconditioners can also be used as smoothing schemes in the multi-grid methods via a preconditioned defect correction step. We treat the additive splitting schemes by a multi-coloring technique to provide the necessary level of parallelism. For controlling the fill-in entries for the ILU factorization we propose a novel method which we call the power(q)-pattern method. We prove that this algorithm produces a new matrix structure with diagonal blocks containing only diagonal entries. This approach provides higher degrees of parallelism in comparison with the level-scheduling/topological sort algorithm. With these techniques we can perform the forward and backward substitution of the preconditioning step in parallel. By formulating the algorithm in block-matrix form we can execute the sweeps in parallel only by performing matrix-vector multiplications. Thus, we can express the data-parallelism in the sweeps without any specification of the underlying hardware or programming models.</p>
<p>In object-oriented languages, an abstraction separates the object behavior from its implementation. Based on this abstraction, we have developed a linear algebra toolbox which supports several platforms such as multi-core CPUs, GPUs and accelerators. The various backends (sequential, OpenMP, CUDA, OpenCL) consist of optimized and platform-specific matrix and vector routines. Using unified interfaces across all platforms, the library allows users to build linear solvers and preconditioners without any information about the underlying hardware. With this technique, we can write our solvers and preconditioners in a single source code for all platforms. Furthermore, we can extend the library by adding new platforms without modifying the existing solvers and preconditioners.</p>
<p>In our tests we consider two scenarios – preconditioned Krylov subspace methods and matrix-based multi-grid methods. We demonstrate speed ups in two directions: first, the preconditioners/smoothers reduce the total solution time by decreasing the number of iterations, and second, the preconditioning/smoothing phase is efficiently executed in parallel providing good scalability across several parallel architectures. We present numerical experiments and performance analysis on several platforms such as multi-core CPU and GPU devices. Furthermore, we show the viability and benefit of the proposed preconditioning schemes and software approach.</p>
</blockquote>
<p>(Dimitar Lukarski: <em>&#8220;Parallel Sparse Linear Algebra for Multi-core and Many-core Platforms : Parallel Solvers and Preconditioners&#8221;</em>, PhD thesis, Fakultät für Mathematik, KIT Karlsruhe, Germany, 2012 [<a href="http://digbib.ubka.uni-karlsruhe.de/volltexte/1000026568" target="_blank">WWW</a>])</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2012/03/02/lukarski-phd/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Performance of SpMV in CUSPARSE, CUSP and SpeedIT</title>
		<link>http://gpgpu.org/2012/01/14/performance-of-spmv-in-cusparse-cusp-and-speedit</link>
		<comments>http://gpgpu.org/2012/01/14/performance-of-spmv-in-cusparse-cusp-and-speedit#comments</comments>
		<pubDate>Sat, 14 Jan 2012 12:43:31 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Business]]></category>
		<category><![CDATA[Developer Resources]]></category>
		<category><![CDATA[Benchmarks]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Sparse Linear Systems]]></category>
		<category><![CDATA[Tools]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=4384</guid>
		<description><![CDATA[The SpeedIt team recently compared and benchmarked the SpMV performance of CUSPARSE 4.0, CUSP 0.2.0 and SpeedIT 2.0 on 23 randomly chosen matrices from University Florida Matrix Collection. Comparisons were done on a Tesla C2050 in single and double precision. The full report is available at http://wp.me/p1ZihD-1.]]></description>
			<content:encoded><![CDATA[<p>The SpeedIt team recently compared and benchmarked the SpMV performance of CUSPARSE 4.0, CUSP 0.2.0 and SpeedIT 2.0 on 23 randomly chosen matrices from University Florida Matrix Collection. Comparisons were done on a Tesla C2050 in single and double precision. The full report is available at <a title="full benchmarking report" href="http://wp.me/p1ZihD-1" target="_blank">http://wp.me/p1ZihD-1</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2012/01/14/performance-of-spmv-in-cusparse-cusp-and-speedit/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>CULA Sparse Now Available</title>
		<link>http://gpgpu.org/2011/11/10/cula-sparse-now-available</link>
		<comments>http://gpgpu.org/2011/11/10/cula-sparse-now-available#comments</comments>
		<pubDate>Thu, 10 Nov 2011 09:09:48 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Business]]></category>
		<category><![CDATA[Developer Resources]]></category>
		<category><![CDATA[Libraries]]></category>
		<category><![CDATA[Numerical Algorithms]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Sparse Linear Systems]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=4131</guid>
		<description><![CDATA[EM Photonics has released CULA Sparse, a ready-to-integrate package for solving sparse linear systems. Features include: Interfaces: C, C++, Fortran, Matlab, Python Platforms: all CUDA platforms. including Linux, Windows, and OS X Solvers and preconditioners: BiCG, BiCGStab, CG, GMRES, MINRES and Jacobi, ILU(0) Data formats: COO, CSR, CSC in double precision real and complex floating [...]]]></description>
			<content:encoded><![CDATA[<p>EM Photonics has released CULA Sparse, a ready-to-integrate package for solving sparse linear systems. Features include:</p>
<ul>
<li>Interfaces: C, C++, Fortran, Matlab, Python</li>
<li>Platforms: all CUDA platforms. including Linux, Windows, and OS X</li>
<li>Solvers and preconditioners: BiCG, BiCGStab, CG, GMRES, MINRES and Jacobi, ILU(0)</li>
<li>Data formats: COO, CSR, CSC in double precision real and complex floating point</li>
<li>No CUDA programming experience required.</li>
</ul>
<p>More information is available at <a href="http://www.culatools.com/sparse/" target="_blank">http://www.culatools.com/sparse</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2011/11/10/cula-sparse-now-available/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods</title>
		<link>http://gpgpu.org/2011/08/04/algebraic-multigrid</link>
		<comments>http://gpgpu.org/2011/08/04/algebraic-multigrid#comments</comments>
		<pubDate>Thu, 04 Aug 2011 07:00:52 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Iterative Solvers]]></category>
		<category><![CDATA[Multigrid]]></category>
		<category><![CDATA[Numerical Algorithms]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Sparse Linear Systems]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=3807</guid>
		<description><![CDATA[Abstract: Algebraic multigrid methods for large, sparse linear systems are a necessity in many computational simulations, yet parallel algorithms for such solvers are generally decomposed into coarse-grained tasks suitable for distributed computers with traditional processing cores. However, accelerating multigrid on massively parallel throughput-oriented processors, such as the GPU, demands algorithms with abundant fine-grained parallelism. In [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>Algebraic multigrid methods for large, sparse linear systems are a necessity in many computational simulations, yet parallel algorithms for such solvers are generally decomposed into coarse-grained tasks suitable for distributed computers with traditional processing cores. However, accelerating multigrid on massively parallel throughput-oriented processors, such as the GPU, demands algorithms with abundant fine-grained parallelism. In this paper, we develop a parallel algebraic multigrid method which exposes substantial fine-grained parallelism in both the construction of the multigrid hierarchy as well as the cycling or solve stage. Our algorithms are expressed in terms of scalable parallel primitives that are efficiently implemented on the GPU. The resulting solver achieves an average speedup of over 2x in the setup phase and around 6x in the cycling phase when compared to a representative CPU implementation.</p></blockquote>
<p>(Nathan Bell, Steven Dalton and Luke Olson: <em>&#8220;Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods&#8221;</em>, NVIDIA Technical Report NVR-2011-002, June 2011 [<a href="http://research.nvidia.com/publication/exposing-fine-grained-parallelism-algebraic-multigrid-methods" target="_blank">PDF and Sources</a>])</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2011/08/04/algebraic-multigrid/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Enhanced Parallel ILU(p)-based Preconditioners for Multi-core CPUs and GPUs &#8212; The Power(q)-pattern Method</title>
		<link>http://gpgpu.org/2011/07/08/parallel-ilup-based-preconditioners</link>
		<comments>http://gpgpu.org/2011/07/08/parallel-ilup-based-preconditioners#comments</comments>
		<pubDate>Fri, 08 Jul 2011 07:19:38 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Multicore]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Sparse Linear Systems]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=3741</guid>
		<description><![CDATA[Abstract: Application demands and grand challenges in numerical simulation require for both highly capable computing platforms and efficient numerical solution schemes. Power constraints and further miniaturization of modern and future hardware give way for multi- and manycore processors with increasing fine-grained parallelism and deeply nested hierarchical memory systems &#8212; as already exemplified by recent graphics [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>Application demands and grand challenges in numerical simulation require for both highly capable computing platforms and efficient numerical solution schemes. Power constraints and further miniaturization of modern and future hardware give way for multi- and manycore processors with increasing fine-grained parallelism and deeply nested hierarchical memory systems &#8212; as already exemplified by recent graphics processing units. Accordingly, numerical schemes need to be adapted and re-engineered in order to deliver scalable solutions across diverse processor configurations. Portability of parallel software solutions across emerging hardware platforms is another challenge. This work investigates multi-coloring and re-ordering schemes for block Gauss-Seidel methods and, in particular, for incomplete LU factorizations with and without fill-ins. We consider two matrix re-ordering schemes that deliver flexible and efficient parallel preconditioners. The general idea is to generate block decompositions of the system matrix such that the diagonal blocks are diagonal itself. In such a way, parallelism can be exploited on the block-level in a scalable manner. Our goal is to provide widely applicable, out-of-the-box preconditioners that can be used in the context of finite element solvers.</p>
<p>We propose a new method for anticipating the fill-in pattern of ILU(p) schemes which we call the power(q)-pattern method. This method is based on an incomplete factorization of the system matrix A subject to a predetermined pattern given by the matrix power |A|<sup>p+1</sup> and its associated multi-coloring permutation pi. We prove that the obtained sparsity pattern is a superset of our modified ILU(p) factorization applied to pi A p<sup>i-1</sup>. As a result, this modified ILU(p) applied to multi-colored system matrix has no fill-ins in its diagonal blocks. This leads to an inherently parallel execution of triangular ILU(p) sweeps.</p>
<p>In addition, we describe the integration of the preconditioners into the HiFlow<sup>3</sup> open-source finite element package that provides a portable software solution across diverse hardware platforms. On this basis, we conduct performance analysis across a variety of test problems on multi-core CPUs and GPUs that proves efficiency, scalability and flexibility of our approach. Our preconditioners achieve a solver acceleration by a factor of up to 1.5, 8 and 85 for three different test problems. The GPU versions of the preconditioned solver are by a factor of up to 4 faster than an OpenMP parallel version on eight cores.</p></blockquote>
<p>(Vincent Heuveline, Dimitar Lukarski and Jan-Philipp Weiss: &#8220;Enhanced Parallel ILU(p)-based Preconditioners for Multi-core CPUs and GPUs &#8212; The Power(q)-pattern Method&#8221;, EMCL Preprint Series, number 08, July 2011 [<a href="http://www.emcl.kit.edu/preprints/emcl-preprint-2011-08.pdf" target="_blank">PDF</a>])</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2011/07/08/parallel-ilup-based-preconditioners/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parallel Solution of Sparse Triangular Linear Systems</title>
		<link>http://gpgpu.org/2011/06/26/parallel-solution-of-sparse-triangular-linear-systems</link>
		<comments>http://gpgpu.org/2011/06/26/parallel-solution-of-sparse-triangular-linear-systems#comments</comments>
		<pubDate>Sun, 26 Jun 2011 23:22:32 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Developer Resources]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Numerical Algorithms]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Sparse Linear Systems]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=3680</guid>
		<description><![CDATA[Abstract: A novel algorithm for solving in parallel a sparse triangular linear system on a graphical processing unit is proposed. It implements the solution of the triangular system in two phases. First, the analysis phase builds a dependency graph based on the matrix sparsity pattern and groups the independent rows into levels. Second, the solve [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>A novel algorithm for solving in parallel a sparse triangular linear system on a graphical processing unit is proposed. It implements the solution of the triangular system in two phases. First, the analysis phase builds a dependency graph based on the matrix sparsity pattern and groups the independent rows into levels. Second, the solve phase obtains the full solution by iterating sequentially across the constructed levels. The solution elements corresponding to each single level are obtained at once in parallel. The numerical experiments are also presented and it is shown that the incomplete-LU and Cholesky preconditioned iterative methods, using the parallel sparse triangular solve algorithm, can achieve on average more than 2x speedup on graphical processing units (GPUs) over their CPU implementation.</p></blockquote>
<p>(Maxim Naumov: <em>&#8220;Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU&#8221;</em>, NVIDIA Technical Report, June 2011. [<a href="http://research.nvidia.com/publication/parallel-solution-sparse-triangular-linear-systems-preconditioned-iterative-methods-gpu" target="_blank">WWW</a>])</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2011/06/26/parallel-solution-of-sparse-triangular-linear-systems/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

