Hacker News .hnnew | past | comments | ask | show | jobs | submit | nullbyte's commentslogin

82.7% on Terminal Bench is crazy

Is it? There are 5 other models near ~80% and it was achieved in March... which in AI-world seems like a century ago.

https://www.tbench.ai/leaderboard/terminal-bench/2.0


those are not verified. I've tried forgecode and I cannot believe they didn't do something to influence the benchmarks

Yup, they were found to be sneaking the answer key using agents.md

https://debugml.github.io/cheating-agents/#sneaking-the-answ...


I guess it depends if you are working on something important to national security. Especially corporate codebases, etc.

Does that include OpenCode? That's what I care about most and it's the primary reason I've been sticking with OAI the past few months.

Anthropic doesn't, but Google and OAI both release open source models. Just not 1T parameter ones.

Exactly, they release cool consumer stuff, but they aren't releasing anything close to the performance of the best open weight Chinese models. They basically compete in the "fun running at home doing basic stuff" scene. (Except OSs 120 by openai but it's been ages since then)

That sentence is giving OpenAI way more credit than they are due.

They released a single open model after being goaded by the community because everyone except "Open"AI were multiple generations into open releases.

We haven't heard a word since, I wouldn't be surprised if it takes them another 6 years to release their next one.


Last paragraph made me chuckle


npm security team has removed the offending package: https://github.com/axios/axios/issues/10604#issuecomment-415...

new installs should be safe now


What a brilliant idea! is this all done locally? That's incredible.


While the vector store is local, it is sending the data to Gemini's API for embedding. (Which if using a paid API key is probably fine for most use cases, no long term retention/training etc.)


works completely locally with a decent model: https://github.com/jakejimenez/sentinelsearch


Make a proof of concept, honestly worked fairly well: https://github.com/jakejimenez/sentinelsearch


I am curious how the TPS compares vs default OS virtual memory paging


I always enjoy reading Anthropic's blogposts, they often have great articles


They did something to Settings after MacOS Monterey that made it very slow. I miss the snappiness of the old app!


I don't know for a fact, but I'd bet a few digits of cold hard cash it's a SwiftUI rewrite that is to blame. (Any1 in the know want to chime in?)

And yeah, it's terrible. Apple doesn't make good apps anymore.

(This is part of why I think electron does so well -- it's not as good as a really good native app [e.g. Sublime Text], but it's way better than the sort of default whatever you'll get doing native. You get a lot of niceness that's built into the web stack.)


Well, perhaps it has something to do with the fact that they started using webviews for stuff like system UI: https://blog.jim-nielsen.com/2022/inspecting-web-views-in-ma...


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: