LLM benchmarks are largely irrelevant when it comes to "state of the art". They tell you if the model does poorly, but they are not at all a reliable signal of whether it does well.
Open-weights models are still lagging quite a bit behind SOTA. E.g. there's still no open model that can match GPT-5 Pro or Gemini 2.5 Pro, and the latter is almost a year old by now.
Open-weights models are still lagging quite a bit behind SOTA. E.g. there's still no open model that can match GPT-5 Pro or Gemini 2.5 Pro, and the latter is almost a year old by now.