More

Humorist2290 · 2026-04-29T12:33:47 1777466027

It's incredible that on the same device we can write pithy little comments, we can also play games, listen to music, read countless books, basically access the whole of the internet, and also empty out bank accounts. I'd wager most of those can be done on many people's devices in under a minute, with little more safeguard than a fingerprint.

To me this isn't some security flaw in android that allows users to do something. It's a fundamental flaw in having most of the world's population forced into using a device whose software, firmware, and hardware are gate kept by a handful of monopolistic companies. They want all your eggs in one basket, and they'll hold the basket for you.

For many people these things mediate a person's interaction with the world. That's not some super fantastic responsibility on Google's shoulders, but a humanist catastrophe caused in part by (and of course handsomely profitable for) Google.

Humorist2290 · 2026-04-19T09:44:26 1776591866

America's culture of individual liberty moved into the national mythos in the last century, replaced by a culture of consumption and commerce over all. People don't have the freedom to build whatever they want because pockets need to be greased, permits need to be reviewed, HOAs need to have their fees, etc.

Humorist2290 · 2026-02-28T11:11:02 1772277062

At some point in the not so distant future, it seems entirely likely for the US to bail out OpenAI / Nvidia / etc using national security as justification. Democrats and Republicans really can get along as long as their donors get what they want. No matter how the regime changes in the coming years, the DoD will keep getting funding, and that funding will increasingly go to vendors who don't mind killing people.

Eisenhower warned of the military-industrial complex, and 60 years later it's eating everyone's lunch.

Humorist2290 · 2026-01-11T22:29:18 1768170558

It needs to be said that your opinion on this is well understood by the community, respected, but also far from impartial. You have a clear vested interest in the success of _these_ tools.

There's a learning curve to any toolset, and it may be that using coding agents effectively is more than a few weeks of upskilling. It may be, and likely will be, that people make their whole careers about being experts on this topic.

But it's still a statistical text prediction model, wrapped in fancy gimmicks, sold at a loss by mostly bad faith actors, and very far from its final form. People waiting to get on the bandwagon could well be waiting to pick up the pieces once it collapses.

mattmanser · 2026-01-11T23:26:32 1768173992

I have a lot of respect from Simon and read a lot of his articles.

But I'm still seeing clear evidence it IS a statistical text prediction model. You ask it the right niche thing and it can only pump out a few variations of the same code, that's clearly someone else's code stolen almost verbatim.

And I just use it 2 or 3 times a day.

How are SimonW and AntiRez not seeing the same thing?

How are they not seeing the propensity for both Claude + ChatGPT to spit out tons of completely pointless error handling code, making what should be a 5 line function a 50 line one?

How are they not seeing that you constantly have to nag it to use modern syntax. Typescript, C#, Python, doesn't matter what you're writing in, it will regularly spit out code patterns that are 10 years out of date. And woe betide you using a library that got updated in the last 2 years. It will constantly revert back to old syntax over and over and over again.

I've also had to deal with a few of my colleagues using AI code on codebases they don't really understand. Wrong sort, id instead of timestamp. Wrong limit. Wrong json encoding, missing key converters. Wrong timezone on dates. A ton of subtle, not obvious, bugs unless you intimately know the code, but would be things you'd look up if you were writing the code.

And that's not even including the bit where the AI obviously decided to edit the wrong search function in a totally different part of the codebase that had nothing to do with what my colleague was doing. But didn't break anything or trigger any tests because it was wrapped in an impossible to hit if clause. And it created a bunch of extra classes to support this phantom code, so hundreds of new lines of code just lurking there, not doing anything but if I hadn't caught it, everyone thinks it does do something.

simonw · 2026-01-12T01:43:49 1768182229

It's mostly a statistical text model, although the RL "reasoning" stuff added in the past 12 months makes that a slightly less true statement - it has extra tricks now to help it bias bits of code to statistically predict that are more likely to work.

The real unlock though is the coding agent harnesses. It doesn't matter any more if it statistically predicts junk code that doesn't compile, because it will see the compiler error and fix it. If you tell it "use red/green TDD" it will write the tests first, then spot when the code fails to pass them and fix that too.

> How are they not seeing the propensity for both Claude + ChatGPT to spit out tons of completely pointless error handling code, making what should be a 5 line function a 50 line one?

TDD helps there a lot - it makes it less likely the model will spit out lines of code that are never executed.

> How are they not seeing that you constantly have to nag it to use modern syntax. Typescript, C#, Python, doesn't matter what you're writing in, it will regularly spit out code patterns that are 10 years out of date.

I find that if I use it in a codebase with modern syntax it will stick to that syntax. A prompting trick I use a lot is "git clone org/repo into /tmp and look at that for inspiration" - that way even a fresh codebase will be able to follow some good conventions from the start.

Plus the moment I see it write code in a style I don't like I tell it what I like instead.

> And that's not even including the bit where the AI obviously decided to edit the wrong search function in a totally different part of the codebase that had nothing to do with what my colleague was doing.

I usually tell it which part of the codebase to execute - or if it decides itself I spot that and tell it that it did the wrong thing - or discard the session entirely and start again with a better prompt.

mattmanser · 2026-01-12T08:58:56 1768208336

Ok, but given the level of detail you're supplying, at that point isn't it quicker to write the code yourself than it is to prompt?

As you have to explain much of this, the natural language words written are much more than just the code and less precise, so it actually takes much longer to type and is more ambiguous. And obviously at the moment ChatGPT tends to make assumptions without asking you, Claude is a little better at asking you for clarification.

I find it so much faster to just ask Claude/ChatGPT for an example of what I'm trying to do and then cut/paste/modify it myself. So just use them as SO on steriods, no agents, no automated coding. Give me the example, and I'll integrate it.

And the end code looks nothing like the supplied example.

I tried using AquaVoice (which is very good) to dictate to it, and that slightly helped, but often I found myself going so slowly just fully prompting the AI when I would have already finished the new code myself at that point.

I was thinking about this last night, I do wonder if this is another example of the difference between deep/narrow coding of specialist/library code and shallow/wide of enterprise/business code.

If you're writing specialist code (like AntiRez), it's dealing with one tight problem. If you're writing enterprise code, it has to take into account so many things, explaining it all to the AI takes forever. Things like use the correct settings from IUserContext, add to the audit in the right place, use the existing utility functions from folder X, add json converters for this data structure, always use this different date encoding because someone made a mistake 10 years ago, etc.

I get that some of these would end up in agents.md/claude.md, but as many people have complained, AI agents often rapidly forget those as the context grows so you have to go through any code generated with a toothcomb, or get it to generate a disproportionate amount of tests, which again you have to explain each and every one.

I guess that will be fixed eventually. But from my perspective, as they're still changing so rapidly and much advice from even 6/9 months ago is now utterly wrong, why not just wait.

I, like many others on this thread, also believe that it's going to take about a week to get up-to-speed when they're finally ready. It's not that I can't use them now, it's that they're slow, unreliable, prone to being a junior on steriods, and actually create more work when reviewing the code than if I'd just written it myself in the first place, and the code is much, much, much worse than MY code. Not necessarily all the people I've worked with's code, but definitely MY code is usually 50-90% more concise.

theshrike79 · 2026-01-12T11:30:43 1768217443

Enterprise code writer here.

> If you're writing enterprise code, it has to take into account so many things, explaining it all to the AI takes forever. Things like use the correct settings from IUserContext, add to the audit in the right place, use the existing utility functions from folder X, add json converters for this data structure, always use this different date encoding because someone made a mistake 10 years ago, etc.

The fix for this is... documentation. All of these need to be documented in a place that's accessible to the agent. That's it.

I've just about one-shotted UI features with Claude just by giving it a screenshot of the Figma design (couldn't be bothered with the MCP) and the ticket about the feature.

It used our very custom front-end components correctly, used the correct testing library, wrote playwright tests and everything. Took me maybe 30 minutes from first prompt to PR.

If I (a backend programmer) had to do it, it would've taken me about a day of trying different things to see which one of the 42 different ways of doing it worked.

mattmanser · 2026-01-12T13:21:00 1768224060

I talk about why that doesn't work the line after you've quoted. Everyone's having problems with context windows and CC/etc. rapidly forgetting instructions.

I'm fullstack, I use AI for FE too. They've been able to do the screenshot trick for over a year now. I know it's pretty good at making a page, but the code is usually rubbish and you'll have a bunch of totally unnecessary useEffect, useMemo and styling in that page that it's picked up from its training data. Do you have any idea what all the useEffect() and useMemo() it's littered all over your new page do? I can guarantee almost all of them are wrong or unnecessary.

I use that page you one-shotted as a starting point, it's not production-grade code. The final thing will look nothing like it. Good for solving the blank page problem for me though.

simonw · 2026-01-12T14:57:47 1768229867

> Everyone's having problems with context windows and CC/etc. rapidly forgetting instructions.

I'm not having those problems at all... because I've developed a robust intuition for how to avoid them!

theshrike79 · 2026-01-12T21:52:46 1768254766

React is hard even for humans to understand :) In my case the LLM can actually make something that works, even if it's ugly and inefficient. I can't do even that, my brain just doesn't speak React, all the overlapping effects and memos and whatever else magic just fries my brain.

insin · 2026-01-12T14:42:14 1768228934

That matches my experience with LLM-aided PRs - if you see a useEffect() with an obvious LLM line-comment above it, it's 95% going to be either unneccessary or buggy (e.g. too-broad dependencies which cause lots of unwanted recomputes).

MaybiusStrip · 2026-01-12T01:32:51 1768181571

You can literally go look at some of antirez's PRs described here in this article. They're not seeing it because it's not there?

Honestly, what you're describing sounds like the older models. If you are getting these sorts of results with Opus 4.5 or 5.2-codex on high I would be very curious to see your prompts/workflow.

suddenlybananas · 2026-01-12T08:46:30 1768207590

People have been saying "Oh use glorp 3.835 and those problems don't happen anymore" for about 3 years at this point. It's always the fact you're not using the latest model that's the problem.

pedeypops · 2026-01-13T06:15:56 1768284956

I agree. I've seen people insist moving to a newer model or fine tuning will make the output more clever, "trust me", sometimes without providing any evidence of before and after for the specific use case. One LLM project I saw released was prettymuch useless, but it wasn't the use case or the architectural limitations that were the problem, nope the next thing on the roadmap was "fixing" it by plugging in a better LLM.

jimmaswell · 2026-01-12T00:07:25 1768176445

> You ask it the right niche thing and it can only pump out a few variations of the same code, that's clearly someone else's code stolen almost verbatim.

There are only so many ways to express the same idea. Even clean room engineers write incidentally identical code to the source sometimes.

mattmanser · 2026-01-12T11:36:38 1768217798

There was an example on here recently where an AI PR to an open source literally had someone else's name in the comments in the code, and included their license.

That's the level of tell-tale that's its just stealing code and modifying a couple of variable names.

For me personally, the code I've seen might be written in a slightly weird style, or have strange, not applicable to the question, additions.

They're so obviously not "clean room" code or incredibly generic, they're the opposite, they're incredibly specific.

johnfn · 2026-01-12T05:28:55 1768195735

How does he have a vested interest in the success of these tools? He doesn't work for an AI company. Why must he have some shady ulterior motive rather than just honestly believing the thing they are stated? Yes, he blogs a lot about AI, but don't you have the cart profoundly before the horse if you are asserting that's a "vested interest"? He was free to blog about whatever he wants. Why would he fervently start blogging about AI if he didn't earnestly believe it was an interesting topic to blog about?

> But it's still a statistical text prediction model

This is reductive to the point of absurdity. What other statistical text prediction model can make tool calls to CLI apps and web searches? It's like saying "a computer is nothing special -- it's just a bunch of wires stuck together"

Humorist2290 · 2026-01-12T07:52:41 1768204361

> Why must he have some shady ulterior motive rather than just honestly believing the thing they are are stated?

I wouldn't say it's shady or even untoward. Simon writes prolifically and he seems quite genuinely interested in this. That he has attached his public persona, and what seems like basically all of his time from the last few years, to LLMs and their derivatives is still a vested interest. I wouldn't even say that's bad. Passion about technology is what drives many of us. But it still needs saying.

> This is reductive to the point of absurdity. What other statistical text prediction model can make tool calls to CLI apps and web searches?

It's just a fact that these things are statistical text prediction models. Sure, they're marvels, but they're not deterministic, nor are they reliable. They are like a slot machine with surprisingly good odds: pull the lever and you're almost guaranteed to get something, maybe a jackpot, maybe you'll lose those tokens. For many people it's cheap enough to just keep pulling the lever until they get what they want, or go bankrupt.

Humorist2290 · 2026-01-11T22:10:56 1768169456

Fun. I don't agree that Claude Code is the real unlock, but mostly because I'm comfortable with doing this myself. That said, the spirit of the article is spot on. The accessibility to run _good_ web services has never been better. If you have a modest budget and an interest, that's enough -- the skill gap is closing. That's good news I think.

But Tailscale is the real unlock in my opinion. Having a slot machine cosplaying as sysadmin is cool, but being able to access services securely from anywhere makes them legitimately usable for daily life. It means your services can be used by friends/family if they can get past an app install and login.

I also take minor issue with running Vaultwarden in this setup. Password managers are maximally sensitive and hosting that data is not as banal as hosting Plex. Personally, I would want Vaultwarden on something properly isolated and locked down.

heavyset_go · 2026-01-11T22:31:53 1768170713

I believe Vaultwarden keeps data encrypted at rest with your master key, so some of the problems inherent to hosting such data can be mitigated.

Humorist2290 · 2026-01-11T22:41:42 1768171302

I can believe this, and it's a good point. I believe Bitwarden does the same. I'm not against Vaultwarden in particular but against colocation of highly sensitive (especially orthogonally sensitive) data in general. It's part of a self-hoster's journey I think: backups, isolation, security, redundancy, energy optimization, etc. are all topics which can easily occupy your free time. When your partner asks whether your photos are more secure in Immich than Google, it can lead to an interesting discussion of nuances.

That said, I'm not sure if Bitwarden is the answer either. There is certainly some value in obscurity, but I think they have a better infosec budget than I do.

Humorist2290 · 2025-12-26T23:08:17 1766790497

So the AI Village folks put together a bunch of LLMs and a basically unrestricted computer environment, told it "raise money" and "do random acts of kindness" and let it cook. It's a technological marvel, it's a moral dilemma, and it's an example of the "altruistic" applications for this technology. Many of us can imagine the far less noble applications.

But Rob Pike's reaction is personal, and many readers here get why. The AI Village folks burned who knows how much cash to essentially generate well wishing spam. For much less, and with higher efficacy, they could've just written the emails themselves.

Humorist2290 · 2025-12-23T07:38:48 1766475528

I'd bet many of the founders would've been amazed at the technology and insist on wide scale adoption. It could've further cemented the power of slaveholders over their slaves. It could've helped to track the movements of native groups. It could've helped to root out loyalists still dangerous to American independence.

Humorist2290 · 2025-11-25T20:48:29 1764103709

One thing that especially interests me about these prompt-injection based attacks is their reproducibility. With some specific version of some firmware it is possible to give reproducible steps to identify the vulnerability, and by extension to demonstrate that it's actually fixed when those same steps fail to reproduce. But with these statistical models, a system card that injects 32 random bits at the beginning is enough to ruin any guarantee of reproducibility. Self-hosted models sure you can hash the weights or something, but with Gemini (/etc) Google (/et al) has a vested interest in preventing security researchers from reproducing their findings.

Also rereading the article, I cannot put down the irony that it seems to use a very similar style sheet to Google Cloud Platform's documentation.

Humorist2290 · 2025-11-24T19:48:06 1764013686

> Again, we have moved past hallucinations and errors to more subtle, and often human-like, concerns.

From my experience we just get both. The constant risk of some catastrophic hallucination buried in the output, in addition to more subtle, and pervasive, concerns. I haven't tried with Gemini 3 but when I prompted Claude to write a 20 page short story it couldn't even keep basic chronology and characters straight. I wonder if the 14 page research paper would stand up to scrutiny.

acters · 2025-11-24T20:03:09 1764014589

I feel like hallucinations have changed over time from factual errors randomly shoehorned into the middle of sentences to the LLMs confidently telling you they are right and even provide their own reasoning to back up their claims, which most of the time are references that don't exist.

njovin · 2025-11-24T22:17:59 1764022679

I recently tasked Claude with reviewing a page of documentation for a framework and writing a fairly simple method using the framework. It spit out some great-looking code but sadly it completely made up an entire stack of functionality that the framework doesn't support.

The conventions even matched the rest of the framework, so it looked kosher and I had to do some searching to see if Claude had referenced an outdated or beta version of the docs. It hadn't - it just hallucinated the funcionality completely.

When I pointed that out, Claude quickly went down a rabbit-hole of writing some very bad code and trying to do some very unconventional things (modifying configuration code in a different part of the project that was not needed for the task at hand) to accomplish the goal. It was almost as if it were embarrassed and trying to rush toward an acceptable answer.

jaccola · 2025-11-25T01:47:57 1764035277

I've noticed the new OpenAI models do self contradiction a lot more than I've ever noticed before! Things like:

- Aha, the error clearly lies in X, because ... so X is fine, the real error is in Y ... so Y is working perfectly. The smoking gun: Z ...

- While you can do A, in practice it is almost never a good idea because ... which is why it's always best to do A

SomewhatLikely · 2025-11-25T06:23:49 1764051829

I've seen it so this too. I had it keeping a running tally over many turns and occasionally it would say something like: "... bringing the total to 304.. 306, no 303. Haha, just kidding I know it's really 310." With the last number being the right one. I'm curious if it's an organic behavior or a taught one. It could be self learned through reinforcement learning, a way to correct itself since it doesn't have access to a backspace key.

k__ · 2025-11-25T18:13:01 1764094381

Yeah.

I worked with Grok 4.1 and it was awesome until it wasn't.

It told me to build something, just to tell me in the end that I could do it smaller and cheaper.

And that multiple times.

Best reply was the one that ended with something algong the lines of "I've built dozens of them!"

emodendroket · 2025-11-25T01:55:20 1764035720

I like when they tell you they’ve personally confirmed a fact in a conversation or something.

gowld · 2025-11-25T03:55:00 1764042900

I got a 3000 word story. Kind of bland, but good enough for cheating in high school.

See prompt, and my follow-up prompts instructing it to check for continuity errors and fix them:

https://pastebin.com/qqb7Fxff

It took me longer to read and verify the story (10 minutes) than to write the prompts.

I got illustrations too. Not great, but serviceable. Image generation costs more compute to iterate and correct errors.

Humorist2290 · 2025-11-25T20:39:43 1764103183

Disappointingly, that is an exceedingly good story for a high school assignment. The use of an appositive phrase alone would raise alarm bells though.

It's nitpicking for flaws, but why not -- what lens on an old DSLR, older than a car, will let you take a macro shot, a wide shot, and a zoom shot of a bird?

In any case I'm not surprised. It's a short story, and it is indeed _serviceable_, but literature is more than just service to an assignment.

Humorist2290 · 2025-11-15T00:52:46 1763167966

It is probably a reference to the report mentioned in this article from September https://reclaimthenet.org/germany-chat-control-false-reports...

  According to the Federal Criminal Police Office (BKA), 99,375 of the 205,728 reports forwarded by the US-based National Center for Missing and Exploited Children (NCMEC) were not criminally relevant, an error rate of 48.3%. This is a rise from 2023, when the number of false positives already stood at 90,950.

Indeed 50% false positive rate sounds surprisingly good, but this is under the "voluntary scheme" where Meta/Google/MS etc are not obligated to report. Notably missing from the article is the total number of scanned messages to get down to 200k reports. To my knowledge, since it's voluntary, they can also report only the very highest confidence detections. If the Danish regime were to impose reporting quotas the total number of reports would rise. And of course -- these are reports, not actually convictions.

Presumably the actual number of criminals caught by this would remain constant, so the FP rate would increase. Unless of course, the definition of criminal expands to keep the FP rate low...

b112 · 2025-11-15T11:11:11 1763205071

I feel this is a good place to add something...

I recall a half decade back, there was discussion of the quit rate of employees, maybe Facebook?, due to literal mental trauma from having to look at and validate pedophile flagged images.

Understand there is pedophilia, then there's horribly violent, next level abusive pedophilia.

I used to work in a department where, adjacently, the RCMP were doing the same. They couldn't handle it, and were constantly resigning. The violence associated with some of the videos and images is what really got them.

The worst part is, the more empathetic you are, the more it hurts to work in this area.

It seems to me that without this sad and damaging problem fixed, monitoring chats won't help much.

How many good people, will we laden with trama, literally waking up screaming at night? It's why the RCMP officers were resigning.

I can't imagine being a jury member at such a case.

Eddy_Viscosity2 · 2025-11-15T13:31:54 1763213514

Because of this issue, many departments put in much stricter protocols for dealing with this kind of material. Only certain people would be exposed to classify/tag it, and these people would only hold that post of a limited period of time. The burden on those people doesn't change, but it can be diluted to mitigate it somewhat.

Its a real and sad problem, but not one that I think can be fixed with technology. To much is on the line to allow for a false positive from a hallucinating robot to destroy a person(s) life.

qnleigh · 2025-11-15T19:55:51 1763236551

I read about that here: https://erinkissane.com/meta-in-myanmar-part-i-the-setup

This remains one of the best things I've found on HN.

gus_massa · 2025-11-15T11:37:43 1763206663

OK, 50% "not criminally relevant".

How many of the other 50% were guilty and how many innocent after an investigation?