> Outperforming Meta’s New Llama 3 I know it's hard to objectively rank LLMs, bu...

coder543 · on May 13, 2024

Human preference data from side by side, anonymous comparisons of models: https://leaderboard.lmsys.org/

Llama3 8B significantly outperforms ChatGPT-3.5, and LLama3 70B is significantly better than that. These are ELO ratings, so it would not be accurate to try to say X is 10% better than Y because the score is 10% higher.

Obviously Falcon 2 is too new to be on the leaderboard yet.

Honestly, I don't think anybody should be using ChatGPT-3.5 as a chatbot at this point. Google and Meta both offer free chatbots that are significantly better than ChatGPT-3.5, among other options.

imjonse · on May 13, 2024

Human preference does not always favor the model that is best at reasoning/code/accuracy whatever. In particular there's a recent article suggesting that Llama 3's friendly and direct chattiness contributes to it having a good standing in the leaderboard.

https://lmsys.org/blog/2024-05-08-llama3/

coder543 · on May 13, 2024

Sure, that’s why I called it out as human preference data. But I still think the leaderboard is one of the best ways to compare models that we currently have.

If you know of better benchmark-based leaderboards where the data hasn’t polluted the training datasets, I’d love to see them, but just giving up on everything isn’t a good option.

The leaderboard is a good starting point to find models worth testing, which can then be painstakingly tested for a particular use case.

imjonse · on May 14, 2024

Oh I didn't mean that. I think it's the best benchmark, just it's not necessarily representative of ordering in any domain apart from generic human preference. So while Llama3 is high up there, we should not conclude for example that it is better at reasoning than all models below it (especially true for the 8B model).

bfLives · on May 13, 2024

I find that kind of surprising; the lack of “customer service voice” is one of the main reasons I prefer the Mistral models over Open AI’s, even if the latter are somewhat better at complex/specific tasks.

iLoveOncall · on May 13, 2024

> Honestly, I don't think anybody should be using ChatGPT-3.5 as a chatbot at this point. Google and Meta both offer free chatbots that are significantly better than ChatGPT-3.5, among other options.

Yet I guarantee you that ChatGPT-3.5 has 95% of the "direct to consumer" marketshare.

Unless you're a technical user, you haven't even heard about any alternative, let alone used them.

Now onto the ranking, I perfectly recognized in my original comment that those comparisons exist, just that they're not highlighted properly in any launch announcement of any new model.

I haven't used Llama, only ChatGPT and the multiple versions of Claude 2 and 3. How am I supposed to know if this Falcon 2 thing is even worth looking at beyond the first paragraph if I have to compare it to a specific model that I haven't used before?

coder543 · on May 13, 2024

> Unless you're a technical user, you haven't even heard about any alternative, let alone used them.

> How am I supposed to know if this Falcon 2 thing is even worth looking at beyond the first paragraph if I have to compare it to a specific model that I haven't used before?

You're not. These press releases are for the "technical users" that have heard of and used all of these alternatives.

They are not offering a Falcon 2 chat service you can use today. They aren't even offering a chat-tuned Falcon 2 model. The Falcon 2 model in question is a base model, not a chat model.

Unless someone is very technical, Falcon 2 is not relevant to them in any way at this point. This is a forum of technical people, which is why it's getting some attention, but I suspect it's still not going to be relevant to most people here.

hmage · on May 13, 2024

There's https://chat.lmsys.org/?leaderboard

Not a __full__ list, but big enough to have some reference.

iLoveOncall · on May 13, 2024

Yeah that's my point. Say in the title that it ranks #X on the leaderboards, not that it's "better" than some cherry-picked model.