throw the sucker at a video card, and watch it finish thousands of times faster on cheaper hardware
This is nonsense even for dense operations. But this matrix is sparse, in which case GPUs are within a modest factor (2-5 or so depending on the matrix, whether you use multiple cores on the CPU, and whether you ask NVidia or Intel). And if the algorithm does not expose a lot of concurrency (as with most multiplicative algorithms), the GPU can be much slower than a CPU.
The right answer is in the middle. Of course speed up depends on what are you comparing, but if you benchmark GPU against decent four core CPU the speed up is in order of magnitude.
jedbrown, please provide a source of your estimates. Of course I'm interested in some highly optimized libraries like BLAS. Hand written code would be on both systems several times slower.
"Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU" (from Intel, but using the best published GPU implementations): http://doi.acm.org/10.1145/1815961.1816021
BTW, you may as well cite CUSP (http://code.google.com/p/cusp-library/) for the sparse implementation, it's not part of CUDA despite being developed by Nathan Bell and Michael Garland (NVidia employees).
This is nonsense even for dense operations. But this matrix is sparse, in which case GPUs are within a modest factor (2-5 or so depending on the matrix, whether you use multiple cores on the CPU, and whether you ask NVidia or Intel). And if the algorithm does not expose a lot of concurrency (as with most multiplicative algorithms), the GPU can be much slower than a CPU.