HN2new | past | comments | ask | show | jobs | submit | XCSme's commentslogin

So, to give an example of my idea, as some people asked why is there a "reasoning model" when I said non-reasoning. The "reasoning model" is just an instant model trained to output text that looks like reasoning.

user: "how many Rs are in strawberry" (A)

↓ reasoning model(A): "The user is asking to count the Rs in 'strawberry'. s t r a w b e r r y, I see 3 Rs. Let me double check stRawbeRRy. Yes, 3 Rs." (B)

↓ summarization model(B): "The answer is 3 Rs in 'strawberry'" (C)

↓ answer model(A,C): "There are 3 Rs in 'strawberry'."



Does this LLM benchmark have any actual credibility? I get why they chose to not publish the actual tests but I find it highly dubious that there are only 15 tests and Gemini 3 Flash performs best.

I actually made it, so I'm not sure if it has credibility, but the tests are simply various (quite simple) questions, and models are just tested on it. I am also surprised Gemini 3 Flash does so well (note that only the MEDIUM reasoning does exceptionally well).

When I look at the results, it does make sense though. Higher models (like Gemini 3 pro) tend to overthink, doubt themselves and go with the wrong solution.

Claude usually fails in subtle ways, sometimes due to formatting or not respecting certain instructions.

From the Chinese models, Qwen 3.5 Plus (Qwen3.5-397B-A17B) does extremely well, and I actually started using it on a AI system for one of my clients, and today they sent me an email they were impressed with one response the AI gave to a customer, so it does translate in real-world usage.

I am not testing any specific thing, the categories there are just as a hint as what the tests are about.

I just added this page to maybe provide a bit more transparency, without divulging the tests: https://aibenchy.com/methodology/


What I'm most confused, is why call it both GPT-5.3 Instant and gpt-5.3-chat?

Seems to be quite similar to 5.3-codex, but somehow almost 2x more expensive: https://aibenchy.com/compare/openai-gpt-5-4-medium/openai-gp...

Surprised they have a lot of text/numbers but no charts.

It also seems to do same or slightly worse than gpt-5.2-chat on my benchmarks. No wonder they didn't show any benchmarks in their blog post (?)

https://aibenchy.com/compare/openai-gpt-5-2-chat-none/openai...


Gemini 3.1 Lite with no reasoning does better than Gpt-5.3 with no reasoning?

https://aibenchy.com/compare/google-gemini-3-1-flash-lite-pr...


Gemini is also far cheaper: Total Cost: $0.256 [Gpt-5.3] vs $0.011

Unless you set Gemini 3.1 Flash-lite to HIGH, then it uses a crazy amount of tokens for reasoning (and also leads to a worse result, maybe it's bugged on high?)

seems like gemini 3 flash is still cheaper unless im reading that website wrong

from the benchmarks , not sure exactly what use cases 3.1 lite will be for


Google kept promoting the "speed" of the model, so I guess it will be useful for some close to real-time use-cases, maybe live chat/support (?)

> 3.1 Flash-Lite (reasoning)

(reasoning) doesn't say much. Is it low/med/high reasoning? I ran my own benchmarks, and 3.1 Flash-Lite on high costs A LOT: https://aibenchy.com/compare/google-gemini-3-1-flash-lite-pr...

Do not use 3.1 Flash-Lite with HIGH reasoning, it reasons for almost max output size, you can quickly get to millions of tokens of reasoning in a few requests.


Wow, that’s very interesting. I wish more benchmarks were reported along with the total cost of running that benchmark. Dollars per token is kind of useless for the reasons you mentioned.

Yup, MiniMax M-2.5 is a standout in that aspect. It's $/token is very low, because it reasons forever (fun fact, that's also the reason why it's #1 on OpenRouter, because it simply burns through tokens, and OpenRouter ranking is based on tokens usage)...


They said it's available in the API too, in the blog post.

EDIT:

> GPT‑5.3 Instant is available starting today to all users in ChatGPT, as well as to developers in the API as ‘gpt-5.3-chat-latest.’ Updates to Thinking and Pro will follow soon. GPT‑5.2 Instant will remain available for three months for paid users in the model picker under the Legacy Models section, after which it will be retired on June 3, 2026.


For coding, I agree, Codex-5.3 is the best out there.

But for the chat, I feel like ChatGPT got worse and worse.


Something very weird is going on; I just tried a free trial of Codex-5.3, and a significant fraction of what it gives me doesn't even compile (or in the case of python, run without crashing).

Unless I specifically say "use git", it won't bother using git, apparently saying "configure AGENTS.md to us best practices" isn't enough for it to (at least in this case) use git. If this was isolated I might put that down to bad luck, given the nature of LLMs, but I have been finding Codex uses the wrong approaches all over the place, also stops in the middle of tasks, skips some tasks entirely (sometimes while marking them as done, other times it just doesn't get around to it).

I'd rank the output of Claude as similar to a junior with 1-3 years experience. It's not great, but it's certainly serviceable, a bit of tweaking even shippable. Codex… what I see is more like a student project. Or perhaps someone in the first month of their first job. Even the absolute worst human developers I've worked with after university weren't as bad as Codex, but several of them I'd rank worse than Claude.


Do you have instructions.md and docs.md files?

Also, I noticed it skips instructions if I steer it with prompts while it is doing stuff instead of queueing my instructions.


Are you using it on Xhigh?

I have not observed meaningful quality differences between the default (medium) and extra high. What does make a difference is to turn the metaphorical lights back on, and instead of vibe-coding (as in, don't even look at the code) actually examine what it did at each step (either at code level or QA) before allowing it to proceed to the next step.

OpenAI's 5.3 Codex model on xhigh still makes a huge number of mistakes, somewhere between 25-50% of commits, and it's still terrible at making its own plan, estimating how long tasks will take to complete, and recognising which tasks need to be subdivided*. Claude's model last November was better on both counts, even though it still wasn't IMO ready for true lights-off-no-code-check-needed-vibe-coding, it was making mistakes far less often and was scoping task complexity appropriately.

That said, given xhigh seems to be going through my token allowance far, far slower than on medium, I wouldn't be surprised if it turns out the Codex app itself is vibe coded and has mis-mapped that setting in some weird way. Either that or they've suddenly got a lot more spare capacity because of the boycotts.

* given the METR study, in the planning phase I ask all these models (Codex and Claude) to break down tasks into things that would take a junior developer 1-2 hours, but Codex will estimate 60 minutes for everything from "write 19 lines including comments to stub 3 empty methods in new class" to tasks I'd expect to take a senior 2 days.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: