> if you're implementing some algorithm directly at the gate level for cryptogra...

> if you're implementing some algorithm directly at the gate level for cryptography or signal processing or whatever then being able to arrange inputs outputs into dataflows is a big win with no roundrips to general purpose registers or bypass networks

This is true, but keep in mind that that sort of algorithm runs insanely well on any CPU or GPU because they, too, do not want to touch main memory. You would be blown away by how much work a CPU can do if you can keep the working set within L1 cache.

Re. ASICs, it's a continuum:

- "flexible, low performance, cheap in small quantities" (CPUs)

- "reasonably flexible, better performance, cheap-ish in small quantities" (GPUs)

- "inflexible, best performance, expensive in small quantities" (ASICs)

FPGAs fit somewhere between GPUs and ASICs -- poor flexibility, maybe great performance, moderate small-quantity price.

If your problem is too big for GPUs, as you say, sometimes it's easiest to jump straight to an ASIC. But it's such a narrow window in the HPC landscape. The vast majority of customers, even with large problems, are just buying a lot of GPUs. They're using off-the-shelf frameworks even though a custom CUDA kernel would give them 10x performance and 10% cost. The cost to go to an FPGA is too great and the performance gain simply isn't there.