<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>GPGPU&#187; Tag: Papers :: GPGPU.org</title>
	<atom:link href="http://gpgpu.org/tag/papers/feed" rel="self" type="application/rss+xml" />
	<link>http://gpgpu.org</link>
	<description>General-Purpose Computation on Graphics Hardware</description>
	<lastBuildDate>Fri, 30 Jul 2010 02:59:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>A complete modular resultant algorithm targeted for realization on graphics hardware</title>
		<link>http://gpgpu.org/2010/07/29/modular-resultant-algorithm</link>
		<comments>http://gpgpu.org/2010/07/29/modular-resultant-algorithm#comments</comments>
		<pubDate>Fri, 30 Jul 2010 01:13:23 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Computer Algebra]]></category>
		<category><![CDATA[Modular Arithmetic]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=2625</guid>
		<description><![CDATA[Abstract: This paper presents a complete modular approach to computing bivariate polynomial resultants on Graphics Processing Units (GPU). Given two polynomials, the algorithm first maps them to a prime field for sufficiently many primes, and then processes each modular image individually. We evaluate each polynomial at several points and compute a set of univariate resultants [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>This paper presents a complete modular approach to computing bivariate polynomial resultants on Graphics Processing Units (GPU). Given two polynomials, the algorithm first maps them to a prime field for sufficiently many primes, and then processes each modular image individually. We evaluate each polynomial at several points and compute a set of univariate resultants for each prime in parallel on the GPU. The remaining &#8220;combine&#8221; stage of the algorithm comprising polynomial interpolation and Chinese remaindering is also executed on the graphics processor. The GPU algorithm returns coefficients of the resultant as a set of Mixed Radix (MR) digits. Finally, the large integer coefficients are recovered from the MR representation on the host machine. With the approach of displacement structure and efficient modular arithmetic we have been able to achieve more than 100x speed-up over a CPU-based resultant algorithm from Maple 13.</p></blockquote>
<p>(Pavel Emeliyanenko: &#8220;A complete modular resultant algorithm targeted for realization on graphics hardware&#8221;, Proceedings of the 4th International Workshop on Parallel and Symbolic Computation (PASCO2010), pages 35-43, Grenoble, France, July 2010. <a href="http://dx.doi.org/10.1145/1837210.1837219" target="_blank">DOI link</a>.  <a href="http://www.mpi-inf.mpg.de/~emeliyan/p35-emeliyanenko.pdf" target="_blank">Direct PDF link</a>.)</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2010/07/29/modular-resultant-algorithm/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>QYMSYM: A GPU-Accelerated Hybrid Symplectic Integrator That Permits Close Encounters</title>
		<link>http://gpgpu.org/2010/07/29/qymsym</link>
		<comments>http://gpgpu.org/2010/07/29/qymsym#comments</comments>
		<pubDate>Fri, 30 Jul 2010 01:08:55 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Astrophysics]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=2622</guid>
		<description><![CDATA[Abstract: We describe a parallel hybrid symplectic integrator for planetary system integration that runs on a graphics processing unit (GPU). The integrator identifies close approaches between particles and switches from symplectic to Hermite algorithms for particles that require higher resolution integrations. The integrator is approximately as accurate as other hybrid symplectic integrators but is GPU [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>We describe a parallel hybrid symplectic integrator for planetary system integration that runs on a graphics processing unit (GPU). The integrator identifies close approaches between particles and switches from symplectic to Hermite algorithms for particles that require higher resolution integrations. The integrator is approximately as accurate as other hybrid symplectic integrators but is GPU accelerated.</p></blockquote>
<p>(Alexander Moore and Alice C. Quillen: &#8220;QYMSYM: A GPU-Accelerated Hybrid Symplectic Integrator That Permits Close Encounters&#8221;. <a href="http://arxiv.org/abs/1007.3458" target="_blank">preprint on arXiv</a>, <a href="http://astro.pas.rochester.edu/~aquillen/qymsym/" target="_blank">available code</a>)</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2010/07/29/qymsym/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SMVM on GPU</title>
		<link>http://gpgpu.org/2010/07/29/smvm-on-gpu</link>
		<comments>http://gpgpu.org/2010/07/29/smvm-on-gpu#comments</comments>
		<pubDate>Fri, 30 Jul 2010 01:07:42 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Electromagnetics]]></category>
		<category><![CDATA[Linear Algebra]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Sparse Linear Systems]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=2617</guid>
		<description><![CDATA[From the paper&#8217;s abstract: A wide class of finite element electromagnetic applications requires computing very large sparse matrix vector multiplications (SMVM). Due to the sparsity pattern and size of the matrices, solvers can run relatively slowly. The rapid evolution of graphic processing units (GPUs) in performance, architecture and programmability make them very attractive platforms for [...]]]></description>
			<content:encoded><![CDATA[<p>From the paper&#8217;s abstract:</p>
<blockquote><p>A wide class of finite element electromagnetic applications requires computing very large sparse matrix vector multiplications (SMVM). Due to the sparsity pattern and size of the matrices, solvers can run relatively slowly. The rapid evolution of graphic processing units (GPUs) in performance, architecture and programmability make them very attractive platforms for accelerating computationally intensive kernels such as SMVM. This work presents a new algorithm to accelerate the performance of the SMVM kernel on graphic processing units.</p></blockquote>
<p>From the paper&#8217;s conclusion:</p>
<blockquote><p>We have introduced several efficient techniques to accelerate the execution of the sparse matrix vector multiplication (SMVM) on NVIDIA graphic processing units. The proposed methods increased the performance of the SMVM kernel on GT 8800 up to 18.8 times compared to the quad core CPU and 3 times compared to previous work by Bell and Garland on accelerating SMVM for GPUs.</p></blockquote>
<p>(M. Mehri Dehnavi, D. Fernandez and D. Giannacopoulos: <em>“Finite element sparse matrix vector multiplication on GPUs”</em>. IEEE Transactions on Magnetics, vol. 46, no. 8, pp. 2982-2985, August 2010. DOI <a href="http://dx.doi.org/10.1109/TMAG.2010.2043511 " target="_blank">10.1109/TMAG.2010.2043511</a>)</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2010/07/29/smvm-on-gpu/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ocelot: A Dynamic Optimization Framework for Bulk-Synchronous Applications in Heterogeneous Systems</title>
		<link>http://gpgpu.org/2010/07/29/ocelot-pact2010</link>
		<comments>http://gpgpu.org/2010/07/29/ocelot-pact2010#comments</comments>
		<pubDate>Fri, 30 Jul 2010 01:01:08 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Developer Resources]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Compilers]]></category>
		<category><![CDATA[Heterogeneneous Computing]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Ocelot]]></category>
		<category><![CDATA[Papers]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=2612</guid>
		<description><![CDATA[Abstract: Ocelot is a dynamic compilation framework designed to map the explicitly data parallel execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms. Ocelot includes a dynamic binary translator from Parallel Thread eXecution ISA (PTX) to many-core processors that leverages the Low Level Virtual Machine (LLVM) code generator to target x86 and other [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://gpgpu.org/wp/wp-content/uploads/2010/07/ocelot.png"><img class="alignright size-full wp-image-2639" title="ocelot" src="http://gpgpu.org/wp/wp-content/uploads/2010/07/ocelot.png" alt="" width="225" height="243" /></a>Abstract:</p>
<blockquote><p>Ocelot is a dynamic compilation framework designed to map the explicitly data parallel execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms. Ocelot includes a dynamic binary translator from Parallel Thread eXecution ISA (PTX) to many-core processors that leverages the Low Level Virtual Machine (LLVM) code generator to target x86 and other ISAs. The dynamic compiler is able to execute existing CUDA binaries without recompilation from source and supports switching between execution on an NVIDIA GPU and a many-core CPU at runtime. It has been validated against over 130 applications taken from the CUDA SDK, the UIUC Parboil benchmark, the Virginia Rodinia benchmarks, the GPU-VSIPL signal and image processing library, the Thrust library, and several domain specific applications.</p>
<p>This paper presents a high level overview of the implementation of the Ocelot dynamic compiler highlighting design decisions and trade-offs, and showcasing their effect on application performance. Several novel code transformations are explored that are applicable only when compiling explicitly parallel applications and traditional dynamic compiler optimizations are revisited for this new class of applications. This study is expected to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures.</p>
<p>This paper identifies several key areas of research and open problems for optimizing the performance of data parallel programs (such as CUDA and OpenCL) that were encountered when designing a binary translator from PTX to LLVM/x86.  The complete implementation of Ocelot is available open-source under the new BSD license at <a href="http://code.google.com/p/gpuocelot" target="_blank">http://code.google.com/p/gpuocelot</a>.  Ongoing work involves translating PTX to AMD&#8217;s IL allowing CUDA programs to be executed on AMD GPUs, developing parallel-aware PTX to PTX optimizations, and exploring new programming and execution models that are layered on PTX.</p></blockquote>
<p>(Gregory Diamos, Andrew Kerr, Sudhakar Yalamanchili and Nathan Clark: <em>&#8220;Ocelot: A dynamic compiler for bulk-synchroneous applications in heterogeneous systems&#8221;</em>. 19 International Conference on Parallel Architectures and Compilation Techniques (PACT2010), September 2010).</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2010/07/29/ocelot-pact2010/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU</title>
		<link>http://gpgpu.org/2010/07/04/debunking-the-100x-myth</link>
		<comments>http://gpgpu.org/2010/07/04/debunking-the-100x-myth#comments</comments>
		<pubDate>Mon, 05 Jul 2010 00:30:41 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Intel]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Speedup]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=2532</guid>
		<description><![CDATA[Abstract: Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today&#8217;s multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an NVIDIA GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.</p></blockquote>
<p>(Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, NadathurSatish, Mikhail Smelyanski, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal and Pradeep Dubey: <em>&#8220;Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU&#8221;</em>, SIGARCH Computer Architecture News 38(3), pp. 451-460, June 2010. <a href="http://doi.acm.org/10.1145/1816038.1816021" target="_blank">DOI Link</a>.)</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2010/07/04/debunking-the-100x-myth/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>GrAVity: A Massively Parallel Antivirus Engine</title>
		<link>http://gpgpu.org/2010/07/04/gravity-antivirus-engine</link>
		<comments>http://gpgpu.org/2010/07/04/gravity-antivirus-engine#comments</comments>
		<pubDate>Mon, 05 Jul 2010 00:28:53 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Pattern Matching]]></category>
		<category><![CDATA[Virus Detection]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=2513</guid>
		<description><![CDATA[Abstract: In the ongoing arms race against malware, antivirus soft-ware is at the forefront, as one of the most important defense tools in our arsenal. Antivirus software is flexible enough to be deployed from regular users desktops, to corporate e-mail proxies and file servers. Unfortunately, the signatures necessary to detect incoming malware number in the [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>In the ongoing arms race against malware, antivirus soft-ware is at the forefront, as one of the most important defense tools in our arsenal. Antivirus software is flexible enough to be deployed from regular users desktops, to corporate e-mail proxies and file servers. Unfortunately, the signatures necessary to detect incoming malware number in the tens of thousands. To make matters worse, antivirus signatures area lot longer than signatures in network intrusion detection systems. This leads to extremely high computation costs necessary to perform matching of suspicious data against those signatures.In this paper, we present GrAVity, a massively parallel antivirus engine.Our engine utilized the compute power of modern graphics processors,that contain hundreds of hardware microprocessors. We have modified ClamAV, the most popular open source antivirus software, to utilize our engine. Our prototype implementation has achieved end-to-end throughput in the order of 20 Gbits/s, 100 times the performance of the CPU-only ClamAV, while almost completely offloading the CPU, leaving it free to complete other tasks. Our micro-benchmarks have measured our engine to be able to sustain throughput in the order of 40 Gbits/s. The results suggest that modern graphics cards can be used effectively to perform heavy-duty anti-malware operations at speeds that cannot be matched by traditional CPU based techniques.</p></blockquote>
<p>(Giorgos Vasiliadis and Sotiris Ioannidis. <em>&#8220;GrAVity: A Massively Parallel Antivirus Engine&#8221;</em>. In Proceedings of the 13th International Symposium On Recent Advances In Intrusion Detection (RAID). September 2010, Ottawa, Canada. <a href="http://www.ics.forth.gr/dcs/Activities/papers/gravity-raid10.pdf" target="_blank">Link to PDF</a>.)</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2010/07/04/gravity-antivirus-engine/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Regular Expression Matching on Graphics Hardware for Intrusion Detection</title>
		<link>http://gpgpu.org/2010/07/04/regular-expression-matching-for-intrusion-detection</link>
		<comments>http://gpgpu.org/2010/07/04/regular-expression-matching-for-intrusion-detection#comments</comments>
		<pubDate>Mon, 05 Jul 2010 00:28:03 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Intrusion Detection]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Pattern Matching]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=2514</guid>
		<description><![CDATA[Abstract: The expressive power of regular expressions has been often exploited in network intrusion detection systems, virus scanners, and Spam filtering applications. However, the flexible pattern matching functionality of regular expressions in these systems comes with significant overheads in terms of both memory and CPU cycles, since every byte of the inspected input needs to [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>The expressive power of regular expressions has been often exploited in network intrusion detection systems, virus scanners, and Spam filtering applications. However, the flexible pattern matching functionality of regular expressions in these systems comes with significant overheads in terms of both memory and CPU cycles, since every byte of the inspected input needs to be processed and compared against a large set of regular expressions.</p>
<p>In this paper we present the design, implementation and evaluation of a regular expression matching engine running on graphics processing units (GPUs). The significant spare computational power and data parallelism capabilities of modern GPUs permits the efficient matching of multiple inputs at the same time against a large set of regular expressions. Our evaluation shows that regular expression matching on graphics hardware can result to a 48 times speedup over traditional CPU implementations and up to 16 Gbit/s in processing throughput. We demonstrate the feasibility of GPU regular expression matching by implementing it in the popular Snort intrusion detection system, which results to a 60% increase in the packet processing throughput.</p></blockquote>
<p>(Giorgos Vasiliadis, Michalis Polychronakis, Spiros Antonatos, Evangelos P. Markatos and Sotiris Ioannidis: <em>&#8220;Regular Expression Matching on Graphics Hardware for Intrusion Detection&#8221;</em>. In Proceedings of the 12th International Symposium On Recent Advances In Intrusion Detection (RAID). September 2009, Saint-Malo, France. <a href="http://www.ics.forth.gr/dcs/Activities/papers/gnort-regexp.raid09.pdf" target="_blank">Link to PDF</a>.)</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2010/07/04/regular-expression-matching-for-intrusion-detection/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster</title>
		<link>http://gpgpu.org/2010/06/23/high-order-finite-element-seismic-wave</link>
		<comments>http://gpgpu.org/2010/06/23/high-order-finite-element-seismic-wave#comments</comments>
		<pubDate>Thu, 24 Jun 2010 03:19:58 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Clusters]]></category>
		<category><![CDATA[Finite Element Methods]]></category>
		<category><![CDATA[High-Performance Computing]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Scientific Computing]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=2491</guid>
		<description><![CDATA[Abstract: We implement a high-order finite-element application, which performs the numerical simulation of seismic wave propagation resulting for instance from earthquakes at the scale of a continent or from active seismic acquisition experiments in the oil industry, on a large cluster of NVIDIA Tesla graphics cards using the CUDA programming environment and non-blocking message passing [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>We implement a high-order finite-element application, which performs the numerical simulation of seismic wave propagation resulting for instance from earthquakes at the scale of a continent or from active seismic acquisition experiments in the oil industry, on a large cluster of NVIDIA Tesla graphics cards using the CUDA programming environment and non-blocking message passing based on MPI. Contrary to many finite-element implementations, ours is implemented successfully in single precision, maximizing the performance of current generation GPUs. We discuss the implementation and optimization of the code and compare it to an existing very optimized implementation in C language and MPI on a classical cluster of CPU nodes. We use mesh coloring to efficiently handle summation operations over degrees of freedom on an unstructured mesh, and non-blocking MPI messages in order to overlap the communications across the network and the data transfer to and from the device via PCIe with calculations on the GPU. We perform a number of numerical tests to validate the single-precision CUDA and MPI implementation and assess its accuracy. We then analyze performance measurements and depending on how the problem is mapped to the reference CPU cluster, we obtain a speedup of 20x or 12x.</p></blockquote>
<p>(<a href="http://www.univ-pau.fr/~dkomati1" target="_blank">Dimitri Komatisch</a>, <a href="http://www.sc.fsu.edu/~erlebach" target="_blank">Gordon Erlebacher</a>, <a href="http://www.mathematik.tu-dortmund.de/~goeddeke" target="_blank">Dominik Göddeke</a> and David Michéa: <em>&#8220;High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster&#8221;</em>, accepted for publication in: Journal of Computational Physics, Jun. 2010. <a href="http://web.univ-pau.fr/~dkomati1/published_papers/JCP_multiGPUs_2010.pdf" target="_blank">PDF preprint</a>. <a href="http://dx.doi.org/10.1016/j.jcp.2010.06.024" target="_blank">DOI link</a>.)</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2010/06/23/high-order-finite-element-seismic-wave/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>&#8220;Believe it or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!&#8221;</title>
		<link>http://gpgpu.org/2010/05/30/ibm-rc24982</link>
		<comments>http://gpgpu.org/2010/05/30/ibm-rc24982#comments</comments>
		<pubDate>Sun, 30 May 2010 21:40:08 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Image Processing]]></category>
		<category><![CDATA[Multicore]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=2339</guid>
		<description><![CDATA[Abstract: In this work, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. The algorithm processes individual images represented as 2-dimensional matrices of single-precision floating-point values using O(n^4) operations involving dot-products and additions. We implement this algorithm on a nVidia GTX [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p>Abstract:</p>
<p>In this work, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. The algorithm processes individual images represented as  2-dimensional matrices of single-precision floating-point values using  O(n^4) operations involving  dot-products and additions.  We implement this algorithm on a nVidia  GTX 285 GPU using CUDA, and also parallelize it for the Intel Xeon  (Nehalem) and IBM Power7 processors, using both manual and automatic  techniques. Pthreads and OpenMP with SSE and VSX vector intrinsics  are used for the manually parallelized version, while a state-of-the-art optimization framework based on the polyhedral  model is used for automatic compiler parallelization and  optimization. The performance of this algorithm on the nVidia GPU  suffers from: (1) a smaller shared memory, (2) unaligned device memory access patterns, (3) expensive atomic operations, and (4)  weaker single-thread performance. On commodity multi-core  processors, the application dataset is small enough to fit in caches, and when parallelized using a combination of task and  short-vector data parallelism (via SSE/VSX) or through fully  automatic optimization from the compiler, the application matches or  beats the performance of the GPU version. The primary reasons for better multi-core performance include larger and faster caches,  higher clock frequency, higher on-chip memory bandwidth, and better  compiler optimization and support for parallelization. The best performing versions on the Power7, Nehalem, and GTX 285 run in  1.02s, 1.82s, and 1.75s, respectively. These results conclusively  demonstrate that, under certain conditions, it is possible for a FLOP-intensive structured application running on a multi-core processor to match or even beat the performance of an equivalent GPU version.</p>
<p>(Rajesh Bordawekar and Uday Bondhugula and Ravi Rao: <em>&#8220;Believe It or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!&#8221;</em>. <a href="http://domino.watson.ibm.com/library/CyberDig.nsf/1e4115aea78b6e7c85256b360066f0d4/9192e6536facfcef85257720005a0265!OpenDocument&#038;Highlight=0,Bordawekar" target="_blank">Technical Report RC24982</a>, IBM Thomas J. Watson Research Center, Apr. 2010.)</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2010/05/30/ibm-rc24982/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>GPU Random Numbers via the Tiny Encryption Algorithm</title>
		<link>http://gpgpu.org/2010/05/20/gpu-random-numbers-tiny-encryption</link>
		<comments>http://gpgpu.org/2010/05/20/gpu-random-numbers-tiny-encryption#comments</comments>
		<pubDate>Thu, 20 May 2010 21:35:02 +0000</pubDate>
		<dc:creator>dom</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[High-Performance Graphics]]></category>
		<category><![CDATA[NVIDIA CUDA]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[Random Number Generation]]></category>

		<guid isPermaLink="false">http://gpgpu.org/?p=2296</guid>
		<description><![CDATA[Abstract: Random numbers are extensively used on the GPU. As more computation is ported to the GPU, it can no longer be treated as rendering hardware alone. Random number generators (RNG) are expected to cater general purpose and graphics applications alike. Such diversity adds to expected requirements of a RNG. A good GPU RNG should [...]]]></description>
			<content:encoded><![CDATA[<p>Abstract:</p>
<blockquote><p>Random numbers are extensively used on the GPU. As more computation is ported to the GPU, it can no longer be treated as rendering hardware alone. Random number generators (RNG) are expected to cater general purpose and graphics applications alike. Such diversity adds to expected requirements of a RNG. A good GPU RNG should be able to provide repeatability, random access, multiple independent streams, speed, and random numbers free from detectable statistical bias. A specific application may require some if not all of the above characteristics at one time. In particular, we hypothesize that not all algorithms need the highest-quality random numbers, so a good GPU RNG should provide a speed quality tradeoff that can be tuned for fast low quality or slower high quality random numbers.</p>
<p>We propose that the Tiny Encryption Algorithm satisfies all of the requirements of a good GPU Pseudo Random Number Generator. We compare our technique against previous approaches, and present an evaluation using standard randomness test suites as well as Perlin noise and a Monte-Carlo shadow algorithm. We show that the quality of random number generation directly affects the quality of the noise produced, however, good quality noise can still be produced with a lower quality random number generator.</p></blockquote>
<p>(Fahad Zafar, Aaron Curtis and Marc Olano, <em>&#8220;GPU Random Numbers via the Tiny Encryption Algorithm&#8221;</em>, HPG 2010: Proceedings of the ACM SIGGRAPH/Eurographics Symposium on High Performance Graphics, (Saarbrücken, Germany, June 2010. <a href="http://www.csee.umbc.edu/~olano/papers/#GPUTEA" target="_blank">Link to preprint</a>.)</p>
]]></content:encoded>
			<wfw:commentRss>http://gpgpu.org/2010/05/20/gpu-random-numbers-tiny-encryption/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
