I know it's hard to objectively rank LLMs, but those are really ridiculous ways to keep track of performance.
If my reference of performance is (like the vast majority of users) ChatGPT-3.5, I have to first know how Llama 3 compares to that to then understand how that new models compare to what I'm using at the moment.
Now, if I look for the performance of Llama 3 compared to ChatGPT-3.5, I don't find it on the official launch page https://ai.meta.com/blog/meta-llama-3/ where it is compared to Gemma 7B it, Mistral 7B Instruct, Gemini Pro 1.5 and Claude 3 Sonnet.
Let's look at the Llama 2 performance on its launch announcement: https://llama.meta.com/llama2/ No GPT-3.5 turbo again.
I get that there are multiple aspects and that there's probably not one overall "performance" metric across all tasks, and I get that you can probably find a comparative between two specific models relatively easily, but there absolutely needs to be a standard by which those performances are communicated. The number of hoops to jump through is ridiculous.
Llama3 8B significantly outperforms ChatGPT-3.5, and LLama3 70B is significantly better than that. These are ELO ratings, so it would not be accurate to try to say X is 10% better than Y because the score is 10% higher.
Obviously Falcon 2 is too new to be on the leaderboard yet.
Honestly, I don't think anybody should be using ChatGPT-3.5 as a chatbot at this point. Google and Meta both offer free chatbots that are significantly better than ChatGPT-3.5, among other options.
Human preference does not always favor the model that is best at reasoning/code/accuracy whatever. In particular there's a recent article suggesting that Llama 3's friendly and direct chattiness contributes to it having a good standing in the leaderboard.
Sure, that’s why I called it out as human preference data. But I still think the leaderboard is one of the best ways to compare models that we currently have.
If you know of better benchmark-based leaderboards where the data hasn’t polluted the training datasets, I’d love to see them, but just giving up on everything isn’t a good option.
The leaderboard is a good starting point to find models worth testing, which can then be painstakingly tested for a particular use case.
Oh I didn't mean that. I think it's the best benchmark, just it's not necessarily representative of ordering in any domain apart from generic human preference. So while Llama3 is high up there, we should not conclude for example that it is better at reasoning than all models below it (especially true for the 8B model).
I find that kind of surprising; the lack of “customer service voice” is one of the main reasons I prefer the Mistral models over Open AI’s, even if the latter are somewhat better at complex/specific tasks.
> Honestly, I don't think anybody should be using ChatGPT-3.5 as a chatbot at this point. Google and Meta both offer free chatbots that are significantly better than ChatGPT-3.5, among other options.
Yet I guarantee you that ChatGPT-3.5 has 95% of the "direct to consumer" marketshare.
Unless you're a technical user, you haven't even heard about any alternative, let alone used them.
Now onto the ranking, I perfectly recognized in my original comment that those comparisons exist, just that they're not highlighted properly in any launch announcement of any new model.
I haven't used Llama, only ChatGPT and the multiple versions of Claude 2 and 3. How am I supposed to know if this Falcon 2 thing is even worth looking at beyond the first paragraph if I have to compare it to a specific model that I haven't used before?
> Unless you're a technical user, you haven't even heard about any alternative, let alone used them.
> How am I supposed to know if this Falcon 2 thing is even worth looking at beyond the first paragraph if I have to compare it to a specific model that I haven't used before?
You're not. These press releases are for the "technical users" that have heard of and used all of these alternatives.
They are not offering a Falcon 2 chat service you can use today. They aren't even offering a chat-tuned Falcon 2 model. The Falcon 2 model in question is a base model, not a chat model.
Unless someone is very technical, Falcon 2 is not relevant to them in any way at this point. This is a forum of technical people, which is why it's getting some attention, but I suspect it's still not going to be relevant to most people here.
I know it's hard to objectively rank LLMs, but those are really ridiculous ways to keep track of performance.
If my reference of performance is (like the vast majority of users) ChatGPT-3.5, I have to first know how Llama 3 compares to that to then understand how that new models compare to what I'm using at the moment.
Now, if I look for the performance of Llama 3 compared to ChatGPT-3.5, I don't find it on the official launch page https://ai.meta.com/blog/meta-llama-3/ where it is compared to Gemma 7B it, Mistral 7B Instruct, Gemini Pro 1.5 and Claude 3 Sonnet.
How does Gemma 7B perform? Well you can only find out how it compares to Llama 2 on the official launch page https://blog.google/technology/developers/gemma-open-models/.
Let's look at the Llama 2 performance on its launch announcement: https://llama.meta.com/llama2/ No GPT-3.5 turbo again.
I get that there are multiple aspects and that there's probably not one overall "performance" metric across all tasks, and I get that you can probably find a comparative between two specific models relatively easily, but there absolutely needs to be a standard by which those performances are communicated. The number of hoops to jump through is ridiculous.