Hacker News new | past | comments | ask | show | jobs | submit login

This eval’s goal is a bit unclear to me, especially given the example questions. They’re very trivia/minutiae like asking about sports goals for example, which is their stated desire to test factual knowledge. But will this ever be possible by an LLM, without web browsing - which they deliberately removed while evaluating?





I think the interesting thing here is the difference between Not Attempt and Incorrect — the goal here seems to be to reduce hallucination

From that perspective, o1-mini seems to perform the best. But only as long as enabling web browsing makes up for lack of base factuality.

>But will this ever be possible by an LLM?

Why not? Just train an unbelievably gigantic LLM that encodes all human knowledge. A hundred trillion parameters ought to do it.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: