This eval’s goal is a bit unclear to me, especially given the example questions....

petesergeant · 2024-10-30T21:13:29 1730322809

I think the interesting thing here is the difference between Not Attempt and Incorrect — the goal here seems to be to reduce hallucination

yunohn · 2024-10-31T09:48:14 1730368094

From that perspective, o1-mini seems to perform the best. But only as long as enabling web browsing makes up for lack of base factuality.

sbierwagen · 2024-10-30T22:32:46 1730327566

>But will this ever be possible by an LLM?

Why not? Just train an unbelievably gigantic LLM that encodes all human knowledge. A hundred trillion parameters ought to do it.