More

michaelbuckbee · 2026-05-08T00:09:09 1778198949

For years the family joke has been that when we take a family picture we need to take 5 because I will inevitably have my eyes closed in the first 4.

So I made an IOS app just for us that does face detection and won't take the actual photo until all of the detected faces have their eyes open.

michaelbuckbee · 2026-05-08T00:05:24 1778198724

I know a few people that are more or less prototyping little agents with it that monitor stuff for them and make some kind of discernment/decision about alerting them.

Nothing mission critical in any sense of it.

greyb · 2026-05-08T00:42:47 1778200967

This was miserably failing for me on a new install. Guess the reason why may have been less to do with my skill and more to do with being on a broken version (or some mix of it).

michaelbuckbee · 2026-05-08T09:54:30 1778234070

To be clear, I tried to install it and it was a shitshow.

michaelbuckbee · 2026-05-05T11:45:45 1777981545

I ran a fairly large production test of this and on _every_ measure except for privacy it was worse than a free tier server hosted LLM.

Not happy about that as I would like to see more local models but that's the current state of things.

https://sendcheckit.com/blog/ai-powered-subject-line-alterna...

gchamonlive · 2026-05-05T12:42:59 1777984979

> on _every_ measure except for privacy it was worse than a free tier server hosted LLM

Would you be able to compare this to other local models in it's class and a above that would fit consumer-grade hardware?

michaelbuckbee · 2026-05-01T12:24:34 1777638274

I did a quick eval comparing Grok 4.3, Opus 4.7 and GPT 4.1 and they actually seem pretty similar:

https://ofw640g9re.evvl.io/

They all did pretty well at a more "formal" tone, but GPT4.1 was the only one that didn't make me cringe with a "casual" tone.

[edit] fwiw, grok was also the fastest+cheapest model, claude was slowest and priciest.

sundarurfriend · 2026-05-01T14:04:06 1777644246

This is the most basic level of eval, of whether they can produce output that will be considered by someone somewhere (usually a young urban US American) as informal toned. Real human communication is far more nuanced than this, different groups have different linguistic registers they're used to and things outside it sound odd even if they can't articulate why. You could also want to be informal but not over-familiar with the other person (for eg. in a discord chat to a new acquaintance) - actually looking at the outputs here, the Claude output seems best fitting for that (in my subjective view anyway) than to the one you gave it - or want many other little variations.

What makes one cringe and another recognize as familiar and comfortable is also pretty subtle and hard to define. These things need nuanced descriptions and examples to actually get right, and it's in understanding those nuances and figuring out the register of the examples that Grok outshines the others.

Romario77 · 2026-05-01T15:18:24 1777648704

you said that English is not your first language, so heads up - you don't need "for" when you use "e.g.", it already means "for example".

idiotsecant · 2026-05-01T17:16:02 1777655762

You presumably do have English as a first language so you should know that sentences begin with capital letters.

Was that helpful and interesting conversation?

jasonjmcghee · 2026-05-01T14:05:59 1777644359

That's Grok 4.2 not 4.3 right?

And why are you comparing to gpt-4.1? (As opposed to one of the 6? model releases since then - would have expected gpt 5.5)

michaelbuckbee · 2026-05-01T15:36:19 1777649779

Good catch, there was an issue with the second hardest thing in programming (caching).

Here's an updated eval with the proper models https://a3bmfqfom3.evvl.io/

Reebz · 2026-05-01T23:53:25 1777679605

Claude 4.7 is the clear winner to me for manager and formal report updates.

As an ex-senior exec (hundreds of staff), the bolded timeline impact is a particular nuance that I would expect a Lead/Director to format for a VP+ audience. Interesting none of the other models did that. My eyes immediately went to impact statement, then worked back to context to grasp the whole situation.

wamatt · 2026-05-01T16:05:17 1777651517

Thanks from where I'm looking Grok 4.3 and Claude 4.7 do a better job on the informal close friend/coworker vibe.

ChatGPT sounds fake / formal phrasing (for the specific close friend context) and has em-dashes and uses capitalization. Hence, ChatGPT does not, imo grok the assignment ;)

andai · 2026-05-01T17:37:12 1777657032

Is it me or did GPT get noticeably more natural in word choice recently? You can see it between 4.1 and 5.5 here, but I'm not sure when that happened. (My guess would be one of the recent 5.x releases.)

Edit: I meant specifically the absence of bizarre phrasing. That seems to have improved.

reissbaker · 2026-05-01T18:24:08 1777659848

Wow, I'm surprised. Grok 4.3 actually is noticeably better than the other two for the close-friend variant. Surprisingly I found Claude the cringiest of the three!

embedding-shape · 2026-05-01T12:27:18 1777638438

I know it's just an evaluation, but seeing an informal message and a prompt to ask to rewrite this informal message to the tone of an "informal message" when the original one sounds just fine, just makes me sad... Not because of this evaluation, but because it reminds me that this is how some people use LLMs, basically asking it to remove your own voice from texts that are generally fine already.

michaelbuckbee · 2026-05-01T13:12:40 1777641160

My sister in law is a pharmacist and the heaviest non-dev ChatGPT user I know and her main use case is writing professionally polite messages to doctors on how the drugs they prescribed to a patient would have killed them had she not caught a particular interaction or common side effect.

There's a lot of "tone" in it as she's not trying to anger these folks, but also it's quite serious, but also there's just everything else happening in medicine.

Feels like a great use.

ryandrake · 2026-05-01T15:53:59 1777650839

Pretty neat. This kind of tone self-moderation comes naturally to good communicators, but I know people (on and off the spectrum) who really, really need help with this, and it's cool to see LLMs are able to do this. There are a surprising number of people in the business world who are just totally unable to tone-police themselves. In the medical field I'd be worried about hallucinations, of course, but presumably your SIL fact-checks the output.

hamdingers · 2026-05-01T17:05:00 1777655100

She does herself a disservice by outsourcing that skill. One day she might have to actually talk to one of these people.

michaelbuckbee · 2026-05-01T18:00:37 1777658437

She's 50 years old has a doctorate in pharmacy and has worked as a hospital pharmacist for two decades.

I don't say this as a "gotcha", but more that even with all that experience she still finds it beneficial and helpful.

hamdingers · 2026-05-01T22:33:02 1777674782

That makes it more sad, to me. Someone with those credentials should be able to communicate with their colleagues effectively. I wonder if she used to be able to.

It appears Hacker News disagrees that social skills are valuable skills. Mea culpa, I should have guessed.

PoignardAzur · 2026-05-02T06:01:22 1777701682

There's something ironic about complaining about other people's social skill while you couldn't be bothered to make a point without sounding dismissive and condescending.

janderson215 · 2026-05-03T11:14:38 1777806878

Navigating tough conversations takes time, attention, and mental energy. I’d rather a pharmacist spend that time on catching another dangerous contraindicated combo of drugs for a different patient. Actually, AI should soon be checking for that, too.

accrual · 2026-05-01T13:18:48 1777641528

All three did well, and while I'm a Claude user, I found the Opus reply here added some unnecessary detail, like "Impact: Minimal; no downstream dependencies are currently at risk". Downstream dependencies weren't mentioned in the original message; for all we know downstream could be relying on a poorly performing API and is impacted by waiting another week for replacement.

ActivePattern · 2026-05-01T17:22:36 1777656156

Seeing this makes me wonder if Grok uses Claude conversations for training.

It's otherwise kind of surprising that they both converge on very similar phrases (e.g. "API integration is kicking my ass") that aren't anywhere in the prompt.

sroussey · 2026-05-01T20:11:11 1777666271

Elon testified this week that SpaceTwitter is indeed distilling from openAI and others.

rafram · 2026-05-01T14:00:57 1777644057

All of these were frankly terrible. I guess Grok’s “informal” version sounded the most like a real human, but only because it reads exactly like an Elon tweet (including his favorite emoji!). It’s obvious what they’ve been training on.

mwigdahl · 2026-05-01T14:49:48 1777646988

GPT 4.1? Why not a 5-class model?

michaelbuckbee · 2026-04-28T12:05:23 1777377923

This feels like a restating of the idea that for any given endeavor AI raises the floor of quality but doesn't push the ceiling.

erikerikson · 2026-04-28T14:28:29 1777386509

My reading of the article is that it claims the ceiling is lowered, especially in the longer term.

michaelbuckbee · 2026-04-27T12:10:06 1777291806

Fwiw - I did a fairly large comparison of Gemini Nano (the in browser ai model) vs a comparable free hosted model of Gemma (from OpenRouter) and the hosted model absolutely trashed the local model on every aspect of speed, reliability, availability, etc. [1]

I'm not particularly happy about that outcome as I wish we had more locally run AI models for reasons of privacy and efficiency, so this is more just a warning that at present there are some severe tradeoffs.

1 - https://sendcheckit.com/blog/ai-powered-subject-line-alterna...

kbx · 2026-04-28T05:46:54 1777355214

Hey, Chrome PM for built-in AI here.

Thanks for the write-up and the comparison, but more importantly for using the API in production!

You’re highlighting the "state of the art" gap we’re working to close. Cloud models will always have the advantage of massive parameter counts, but our bet is that for a huge class of simpler or high-volume tasks, the upsides of on-device (e.g. zero-cost, permission-less start with no quotas/infra, network-resilience, privacy) make it a compelling trade-off.

The models have been getting better at a rapid clip, and the team is heads-down on optimizing performance and reliability. To that end, we're always grateful for feedback. If you hit specific bugs, crashes, or quality regressions, filing a report with repro steps is the best way to help us improve. You can file those on crbug.com under the "Chromium > Blink > AI" component.

michaelbuckbee · 2026-04-22T11:07:50 1776856070

Maybe check out TRMNL, they've got a Home Assistant plugin.

michaelbuckbee · 2026-04-21T15:24:59 1776785099

From the article: "Polar bears are still sadly expected to go extinct this century, with two-thirds of the population gone by 2050,"

michaelbuckbee · 2026-04-20T13:16:17 1776690977

There's a push and pull here, Typescript + React + Vercel are also very amenable to LLM driven development due to a mix of the popularity of examples in the LLMs dataset, how cheap the deployment is and how quick the ecosystem is to get going.

michaelbuckbee · 2026-04-19T23:00:12 1776639612

Memory makers make capital investements (build different factories, convert physical production lines, etc.) to meet orders that have been place for the next ~5 years.

OpenAI (or whoever) crashes and can't pay for the order leaving the memory makers in a tough spot.

justsomehnguy · 2026-04-20T05:03:29 1776661409

> leaving the memory makers in a tough spot

Oh noes! Think of a poor memory makers!

The amount of money flowing both from the AI bubble and from quite literally scalping both the server and consumer market... They gambled on the opportunity and if they fail - it's their problem.

0-_-0 · 2026-04-20T07:14:31 1776669271

Exactly, that's why they are not building more capacity and that's why RAM prices will stay up for years.

justsomehnguy · 2026-04-20T17:25:28 1776705928

And how is that a problem and more importantly how's that a problem of Average Joe?

Capitalists did their gamble things. If they fail in that gamble what forbids them to sell the regular RAM they made for AI bubbleists to the regular consumers? Besides HBM it's just the regular chips which are exactly the same for the consumer/server market, why it would be any different?