More

dmk · 2026-02-23T17:57:03 1771869423

So I guess the key takeaway is basically that the better Claude gets at producing polished output, the less users bother questioning it. They found that artifact conversations have lower rates of fact-checking and reasoning challenges across the board. That's kind of an uncomfortable loop for a company selling increasingly capable models.

Terr_ · 2026-02-23T19:30:16 1771875016

> the less users bother questioning it

This makes me think of checklists. We have decades of experience in uncountable areas showing that checklists reminding users to question the universe improve outcomes: Is the chemical mixture at the temperature indicated by the chart? Did you get confirmation from Air Traffic Control? Are you about to amputate the correct limb? Is this really the file you want to permanently erase?

Yet our human brains are usually primed to skip steps, take shortcuts, and see what we expect rather than what's really there. It's surprisingly hard to keep doing the work both consistently and to notice deviations.

> lower rates of fact-checking and reasoning challenges

Now here we are with LLMs, geared to produce a flood of superficially-plausible output which strikes at our weak-point, the ability to do intentional review in a deep and sustained way. We've automated the stuff that wasn't as-hard and putting an even greater amount of pressure on the remaining bottleneck.

Rather than the old definition involving customer interaction and ads, I fear the new "attention economy" is going to be managing the scarce resource of human inspection and validation.

jimbokun · 2026-02-23T19:59:54 1771876794

Sounds like having a strong checklist of steps to take for every pull request will be crucial for creating reliable and correct software when AIs write most of the code.

But the temptation to short change this step when it becomes the bottleneck for shipping code will become immense.

boplicity · 2026-02-23T19:35:38 1771875338

> So I guess the key takeaway is basically that the better Claude gets at producing polished output, the less users bother questioning it.

This is exactly what I worry about when I use AI tools to generate code. Even if I check it, and it seems to work, it's easy to think, "oh, I'm done." However, I'll (often) later find obvious logical errors that make all of the code suspect. I don't bother, most of the time though.

I'm starting to group code in my head by code I've thoroughly thought about, and "suspect" code that, while it seems to work, is inherently not trustworthy.

Florin_Andrei · 2026-02-23T18:09:03 1771870143

I think we're still at the stage where model performance largely depends on:

- how many data sources it has access to

- the quality of your prompts

So, if prompting quality decreases, so does model performance.

dmk · 2026-02-23T18:15:14 1771870514

Sure, but the study is saying something slightly different, it's not that people write bad prompts for artifacts, they actually write better ones (more specific, more examples, clearer goals,...). They just stop evaluating the result. So the input quality goes up but the quality control goes down.

jimbokun · 2026-02-23T20:02:29 1771876949

Seems like it’s impossible for output to be good if the prompt is bad. Unless the AI is ignoring the literal instructions and just guessing “what you really want” which would be bad in a different way.

AnIrishDuck · 2026-02-23T20:34:29 1771878869

> On two occasions I have been asked, — "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" In one case a member of the Upper, and in the other a member of the Lower, House put this question. I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

- Charles Babbage, https://archive.org/details/passagesfromlife03char/page/67/m...

EDIT: This is a new iteration of an old problem. Even GIGO [1] arguably predates computers and describes a lot of systemic problems. It does seem a lot more difficult to distinguish between a "garbage" or "good" prompt though. Perhaps this problem is just going to keep getting harder.

1. https://en.wikipedia.org/wiki/Garbage_in,_garbage_out

candiddevmike · 2026-02-23T18:32:11 1771871531

What does prompting quality even mean, empirically? I feel like the LLM providers could/should provide prompt scoring as some kind of metric and provide hints to users on ways they can improve (possibly including ways the LLM is specifically trained to act for a given prompt).

dsr_ · 2026-02-23T18:33:18 1771871598

That would be a quality metric, and right now they are focused on quantity metrics.

dmk · 2026-02-09T19:19:54 1770664794

The real insight buried in here is "build what programmers love and everyone will follow." If every user has an agent that can write code against your product, your API docs become your actual product. That's a massive shift.

anthuswilliams · 2026-02-09T20:12:36 1770667956

I'm very much looking forward to this shift. It is SO MUCH more pro-consumer than the existing SaaS model. Right now every app feels like a walled garden, with broken UX, constant redesigns, enormous amounts of telemetry and user manipulation. It feels like every time I ask for programmatic access to SaaS tools in order to simplify a workflow, I get stuck in endless meetings with product managers trying to "understand my use case", even for products explicitly marketed to programmers.

Using agents that interact with APIs represents people being able to own their user experience more. Why not craft a frontend that behaves exactly the the way YOU want it to, tailor made for YOUR work, abstracting the set of products you are using and focusing only on the actual relevant bits of the work you are doing? Maybe a downside might be that there is more explicit metering of use in these products instead of the per-user licensing that is common today. But the upside is there is so much less scope for engagement-hacking, dark patterns, useless upselling, and so on.

pjc50 · 2026-02-10T13:53:10 1770731590

> Right now every app feels like a walled garden, with broken UX, constant redesigns, enormous amounts of telemetry and user manipulation

OK, but: that's an economic situation.

> so much less scope for engagement-hacking, dark patterns, useless upselling, and so on.

Right, so there's less profit in it.

To me it seems this will make the market more adversarial, not less. Increasing amounts of effort will be expended to prevent LLMs interacting with your software or web pages. Or in some cases exploit the user's agentic LLM to make a bad decision on their behalf.

13pixels · 2026-02-10T15:12:27 1770736347

the "exploit the user's agentic LLM" angle is underappreciated imo. we already see prompt injection attacks in the wild -- hidden text on web pages that tells the agent to do things the user didn't ask for. now scale that to every e-commerce site, every SaaS onboarding flow, every comparison page.

it's basically SEO all over again but worse, because the attack surface is the user's own decision-making proxy. at least with google you could see the search results and decide yourself. when your agent just picks a vendor for you based on what it "found," the incentive to manipulate that process is enormous.

we're going to need something like a trust layer between agents and the services they interact with. otherwise it's just an arms race between agent-facing dark patterns and whatever defenses the model providers build in.

anthuswilliams · 2026-02-10T20:52:41 1770756761

Maybe. Or maybe services will switch to charging per API call or whatever instead of monthly or per-seat. Who can predict the future?

I mean, services _could_ make it harder to use LLMs to interact with them, but if agents are popular enough they might see customers start to revolt over it.

13pixels · 2026-02-10T15:11:33 1770736293

This extends further than most people realize. If agents are the primary consumers of your product surface, then the entire discoverability layer shifts too. Right now Google indexes your marketing page -- soon the question is whether Claude or GPT can even find and correctly describe what your product does when a user asks.

We're already seeing this with search. Ask an LLM "what tools do X" and the answer depends heavily on structured data, citation patterns, and how well your docs/content map to the LLM's training. Companies with great API docs but zero presence in the training data just won't exist to these agents.

So it's not just "API docs = product" -- it's more like "machine-legible presence = existence." Which is a weird new SEO-like discipline that barely has a name yet.

dmk · 2026-02-08T21:02:51 1770584571

The "start over in an hour" philosophy is underrated. I've been running my own infrastructure for years and the single most empowering thing isn't the setup, it's the peace of mind that you can just nuke it and spin up somewhere else.

Knowing that, I started looking at every SaaS subscription very differently.

abnercoimbre · 2026-02-08T21:46:21 1770587181

I really care about the teardown / re-deployment workflow. You got any general tips for the beginner self-hoster?

plausibility · 2026-02-09T08:01:21 1770624081

At the lower or easier end, there’s your standard containerisation tools like Docker Compose or the Podman equivalents. Just move your compose files and zip the mount folders and you can move stuff easily enough.

Middle ground you’ve got stuff like Ansible for if you want to install things without containers, but still want it to be scripted. I don’t use these much since they feel like the worst of both worlds.

Higher end in terms of effort is using something like NixOS, where you get basically Terraform for everything in your distro.

rschachte · 2026-02-09T09:00:02 1770627602

Ansible, git ops, actually testing it out. Backups with snapshots using restic, encrypted secrets using vault.

dmk · 2026-02-05T18:00:43 1770314443

The benchmarks are cool and all but 1M context on an Opus-class model is the real headline here imo. Has anyone actually pushed it to the limit yet? Long context has historically been one of those "works great in the demo" situations.

pants2 · 2026-02-05T18:27:21 1770316041

Paying $10 per request doesn't have me jumping at the opportunity to try it!

schappim · 2026-02-05T18:58:29 1770317909

The only way to not go bankrupt is to use a Claude Code Max subscription…

dmk · 2026-02-06T17:37:25 1770399445

Yeah, just had to upgrade to Max 20x yesterday because of hitting the limits every day and the extra usage gets expensive very fast.

cedws · 2026-02-05T19:08:43 1770318523

Makes me wonder: do employees at Anthropic get unmetered access to Claude models?

swader999 · 2026-02-05T21:29:47 1770326987

It's like when you work at McDonald's and get one free meal a day. Lol, of course they get access to the full model way before we do...

danw1979 · 2026-02-06T07:10:20 1770361820

Boris Cherny, creator of Claude Code, posted about how he used Claude a month ago. He’s got half a dozen Opus sessions on the burners constantly. So yes, I expect it’s unmetered.

https://x.com/bcherny/status/2007179832300581177

ajam1507 · 2026-02-05T21:40:37 1770327637

Seems quite obvious that they do, within reason.

_dark_matter_ · 2026-02-06T10:19:47 1770373187

Don't most jobs have unmetered access? I know mine does

awestroke · 2026-02-05T18:55:46 1770317746

Opus 4.5 starts being lazy and stupid at around the 50% context mark in my opinion, which makes me skeptical that this 1M context mode can produce good output. But I'll probably try it out and see

nomel · 2026-02-05T19:15:47 1770318947

Has a "N million context window" spec ever been meaningful? Very old, very terrible, models "supported" 1M context window, but would lose track after two small paragraphs of context into a conversation (looking at you early Gemini).

libraryofbabel · 2026-02-05T20:20:36 1770322836

Umm, Sonnet 4.5 has a 1m context window option if you are using it through the api, and it works pretty well. I tend not to reach for it much these days because I prefer Opus 4.5 so much that I don't mind the added pain of clearing context, but it's perfectly usable. I'm very excited I'll get this from Opus now too.

nomel · 2026-02-05T23:10:50 1770333050

If you're getting on along with 4.5, then that suggests you didn't actually need the large context window, for your use. If that's true, what's the clear tell that it's working well? Am I misunderstanding?

Did they solve the "lost in the middle" problem? Proof will be in the pudding, I suppose. But that number alone isn't all that meaningful for many (most?) practical uses. Claude 4.5 often starts reverting bug fixes ~50k tokens back, which isn't a context window length problem.

Things fall apart much sooner than the context window length for all of my use cases (which are more reasoning related). What is a good use case? Do those use cases require strong verification to combat the "lost in the middle" problems?

dmk · 2026-01-20T15:48:35 1768924115

Living in the EU, I'm skeptical any of this happens. Our leaders have been pretty reluctant to push back on anything so far and most of these assets are private anyway.

consumer451 · 2026-01-20T16:01:19 1768924879

Wouldn't this be done by individual institutions and countries, not all once by "the EU?"

Evidence of that:

> Danish pension fund divesting US Treasuries

https://hackernews.hn/item?id=46692594

pqtyw · 2026-01-20T16:19:24 1768925964

That's a tiny barely significant amount, though.

However the amount of US treasuries Denmark holds but privately and publicly did decrease by 20% or so over the last yea which I guess is something..

dmk · 2026-01-20T16:05:08 1768925108

Fair point. Though I wonder if individual fund moves actually move the needle here or if it's mostly symbolic until it becomes a trend.

consumer451 · 2026-01-20T21:46:44 1768945604

I believe that the best political speech of our time has just been presented. [0]

I believe that you might be a fellow European. If you happen to have 30 minutes to listen, I would love to hear your feedback.

[0] https://www.youtube.com/live/dE981Z_TaVo?t=100s

dmk · 2025-11-05T01:41:48 1762306908

Hi Troy, just wanted to let you know that I just sent you an email! :)

Also, just to be sure, I sent it to on-board.ai domain as well, as that seemed like the correct website (onboard.ai just showed "for sale" page). Might help some others too.

dmk · 2025-09-03T15:31:54 1756913514

Wow, looks amazing, will definitely apply!

Just FYI, the link next to the Founding Engineer is leading to Founding Creator instead of the Founding Engineer: https://mitteai.notion.site/Founding-Engineer-254f3cdf01fb80....

dmk · 2025-09-03T14:56:21 1756911381

Looks like the correct URL is https://jobs.ashbyhq.com/PlantingSpace.

dmk · on March 5, 2024

Google login also seems to be having issues, multiple people reported to me that the login isn’t working and they’ve been logged out of their Google accounts.

ExoticPearTree · on March 5, 2024

Yes, I tried logging in today in two distinct Google accounts on separate Chrome profiles and it would sign me out in about ~ 5 seconds after logging in. And the login process was very sluggish.

dmk · on Aug 20, 2023

This is great! This brings me back to my childhood, I loved this game, really looking forward to trying it out.