As an XLA:GPU person I'm curious how the performance of Julia natively compiling to CUDA compares to using XLA:GPU.
In particular, is this a promising approach, or do you see it as a dead end compared to generating GPU code natively? If it's promising, are there things we need to do in XLA:GPU to make it less awful for you?
(Reasons you might want to use XLA:GPU include, you don't have to reinvent all our performance and correctness hacks for cudnn, and maybe our kernels run faster since we're targeting such a limited domain?)
We've been meaning to run this comparison, but haven't gotten around to it yet. I expect it to work and am hoping to see some performance benefits. It should be fairly straightforward to see it working, the only reason we haven't tried so far is that we only have xrt hooked up and the TF infeed ops are not open source, so the existing code doesn't just work. It should be straightforward to hook up the xla service instead, but it's a bit of additional code to write that we haven't gotten to.
In particular, is this a promising approach, or do you see it as a dead end compared to generating GPU code natively? If it's promising, are there things we need to do in XLA:GPU to make it less awful for you?
(Reasons you might want to use XLA:GPU include, you don't have to reinvent all our performance and correctness hacks for cudnn, and maybe our kernels run faster since we're targeting such a limited domain?)