MAGMA 1.0 RC1 is now available, including the MAGMA sources. MAGMA 1.0 RC1 is intended for a single CUDA enabled NVIDIA GPU. It extends version 0.2 by adding support for Fermi GPUs (see the sample performances for LU, QR, and Cholesky).
Included are routines for the following algorithms:
- LU, QR, and Cholesky factorizations in both real and complex arithmetic (single and double);
- Linear solvers based on LU, QR, and Cholesky in both real and complex arithmetic (single and double);
- Mixed-precision iterative refinement solvers based on LU, QR, and Cholesky in both real and complex arithmetic;
- MAGMA BLAS in real arithmetic (single and double), including gemm, gemv, symv, and trsm.
See the MAGMA homepage for a download link.
The ASIM (Arbeitsgruppe Simulation) and the TUM are jointly organizing the ASIM Workshop 2011 at Technische Universität München (TUM) and the Leibniz Supercomputing Centre, Germany. The workshop theme is “Trends in Computational Science and Engineering: Foundations of Modeling and Simulation” and will take place March 14 to March 16, 2011. The conference program consists of two building blocks: contributed talks and an extensive poster session for new and upcoming Ph.D. students. Poster submissions are cordially invited; registration closes February 12, 2011. More information is available at http://www5.in.tum.de/asim2011.html.
From a press release:
Nexiwave.com, the speech indexing company, announced a partnership with UbiCast, a leading webcast equipment and hosting provider. Through the partnership, UbiCast will become the first company to offer deep audio search as a standard, cost-effective feature to customers.
Florent Thiery, CTO of UbiCast, said: “UbiCast customers produce large amounts of high-value content, but finding and retrieving archived information has been a challenge. Until now, rich spoken content has not been searchable on a broad scale because it was simply too expensive to process. The new Nexiwave.com technology, which is accelerated by GPUs, is making ubiquitous processing cost-justifiable for the first time ever.” Read the rest of this entry »
The OpenFOAM SpeedIT plugin version 1.1 has been released under the GPL License. The most important new features are:
- Multi-GPU support
- Tested on Fermi architecture (GTX460 and Tesla C2050)
- Automated submission of the domain to the GPU cards (using decomposePar from OpenFOAM)
- Optimized submission of computational tasks to the best GPU card in the system for any number of computational threads
- Plugin picks the most powerful GPU card for a single thread cases
The OpenFOAM SpeedIT plugin is available at http://speedit.vratis.com.
Next year’s spring meeting of the German Physical Society (DPG) in Dresden, Germany, includes a focus session “GPU Computing”. This session is jointly organized by the divisions “Physics of Socio-Economic Systems (SOE)” and “Statistical Physics and Dynamics (DY)”. Therefore, a large audience is guaranteed. Although this is the annual meeting of the German Physical Society, it has become an international meeting where almost all of the talks are presented in English. It is a large and diverse meeting with about 7000 participants altogether. The meeting takes place from March 14-18, 2011. Abstracts for contributions are cordially invited and should be submitted online by Wednesday, Dec. 1st.
A new major release of rCUDA™ (Remote CUDA), the Open Source package that allows performing CUDA calls to remote GPUs, has been released. The major improvements included in the new version are:
- Updated API to 3.1
- Server now uses Runtime API when possible (CUDA >= 3.1 required)
- Introduced support for the most common CUBLAS routines
- Fixed some bugs
- Added AF_UNIX sockets support to enhance performance on local executions
- Added some load balancing capabilities to the server
- General performance improvements
- Officially added Fermi support
Further information is available from the rCUDA™ webpages http://www.gap.upv.es/rCUDA and http://www.hpca.uji.es/rCUDA.
For workloads with abundant parallelism, GPUs deliver higher peak computational throughput than latency-oriented CPUs. Key insights of this article: Throughput-oriented processors tackle problems where parallelism is abundant, yielding design decisions different from more traditional latency oriented processors. Due to their design, programming throughput-oriented processors requires much more emphasis on parallelism and scalability than programming sequential processors. GPUs are the leading exemplars of modern throughput-oriented architecture, providing a ubiquitous commodity platform for exploring throughput-oriented programming.
(Michael Garland and David B. Kirk, “Understanding throughput-oriented architectures”, Commununications of the ACM 53(11), 58-66, Nov. 2010. [DOI])
Papers are solicited for the 2011 Symposium on Application Accelerators in High-Performance Computing. Presentations from technology developers and the academic user community are invited on the following topics:
- novel accelerator processors, systems, and architectures
- integration of accelerators with high-performance computing systems
- programming models for accelerator-based computing
- languages and compilers for accelerator-based computing
- run-time environments, profiling and debugging tools for accelerator-based computing
- scientific and engineering applications that use application accelerators
In addition to the general session, submissions are invited for the following domain-specific topics:
- Computational chemistry on accelerators (Chair: TBD)
- Lattice QCD (Chair: Steven Gottlieb, Indiana University, Bloomington)
- Weather and climate modeling (Chair: John Michalakes, National Renewable Energy Laboratory)
- Bioinformatics (Chair: TBD)
Submissions are due May 6, 2011, and more information can be found at the symposium website www.saahpc.org.
CUDA 3.2 has been released and can be downloaded from http://developer.nvidia.com/object/cuda_3_2_downloads.html. New features include:
New and Improved CUDA Libraries
- CUBLAS performance improved 50% to 300% on Fermi architecture GPUs, for matrix multiplication of all datatypes and transpose variations
CUFFT performance tuned for radix-3, -5, and -7 transform sizes on Fermi architecture GPUs, now 2x to 10x faster than MKL
- New CUSPARSE library of GPU-accelerated sparse matrix routines for sparse/sparse and dense/sparse operations delivers 5x to 30x faster performance than MKL
- New CURAND library of GPU-accelerated random number generation (RNG) routines, supporting Sobol quasi-random and XORWOW pseudo-random routines at 10x to 20x faster than similar routines in MKL
- H.264 encode/decode libraries now included in the CUDA Toolkit
CUDA Driver & CUDA C Runtime
- Support for new 6GB Quadro and Tesla products
- New support for enabling high performance Tesla Compute Cluster (TCC) mode on Tesla GPUs in Windows desktop workstations Read the rest of this entry »
From a recent announcement:
We are excited to announce the immediate availability of Cluster GPU Instances for Amazon EC2, a new instance type designed to deliver the power of GPU processing in the cloud. GPUs are increasingly being used to accelerate the performance of many general purpose computing problems. However, for many organizations, GPU processing has been out of reach due to the unique infrastructural challenges and high cost of the technology. Amazon Cluster GPU Instances remove this barrier by providing developers and businesses immediate access to the highly tuned compute performance of GPUs with no upfront investment or long-term commitment.
Learn more about the new Cluster GPU instances for Amazon EC2 and their use in running HPC applications.
Also, community support is becoming available; see for instance this blog post about SCG-Ruby on EC2 instances.