Matrix multiplication on GPUs is non-deterministic. As are things like cumsum() ...

zadikian · 2026-05-04T22:32:26 1777933946

Even if it were reproducible, realistically most people are using some service like Claude that makes no guarantee that the model or hardware didn't change. Which is fine, it doesn't need reproducibility.

This is interesting though, I didn't know PyTorch had a debug mode for reproducibility.

jmalicki · 2026-05-04T23:29:16 1777937356

Even with this debug mode, a different batch size can give different results for the same input - e.g. your tensor multiplies might use different blocking, hence different associativity.

I posted that to show that at a bare minimum, there is some pretty extreme nondeterminism (though probably mild in effect) in even the most pedestrian GPU inference, unless you go to the extreme of using the debug mode and taking the potential performance hit.

vrighter · 2026-05-04T10:57:54 1777892274

well just run all inference on the cpu, single threaded /s