https://openreview.net/pdf?id=mdA5lVvNcU
And the review is pretty damning regarding statistical validity of LLM benchmarks.
https://openreview.net/pdf?id=mdA5lVvNcU
And the review is pretty damning regarding statistical validity of LLM benchmarks.