Hacker News .hnnew | past | comments | ask | show | jobs | submit | frtime3d's commentslogin

> If they specifically tried to cheat at this benchmark it would be obvious and they would be called out

I doubt it. Most would just go “Wow, it really looks like a pelican on a bicycle this time! It must be a good LLM!”

Most people trust benchmarks if they seem to be a reasonable test of something they assume may be relevant to them. While a pelican on a bicycle may not be something they would necessarily want, they want an LLM that could produce a pelican on a bicycle.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: