Hate to be the party pooper, but these two points are hardly evidence of an autonomous attack.
Don't get me wrong: it would certainly be very valuable to any LLM developer or deployer to know that other plausible scenarios [1] have been disproved. Since LLMs are a black box, investigating or reproducing this would be very difficult, but worth the effort if there's no other explanation. However, if this was not caused by the internal mechanisms of the model, it just becomes a fishing expedition for red herrings.
Things that would indicate no human intervention at any point in the chain:
- log of actual changes (e.g., commits) to configurations (e.g., system prompt, user prompts), before and after the event, not self-reported by the agent;
- log of the chat session inputs and outputs, and the agent thinking chain;
- log of account logins;
- info on the model deployment, OpenClaw configs, etc.
That said, this seems to be an example where many, including the author, want to discuss a particular cause (instrumental convergence) and its implications, regardless of the real cause. And that's OK, I guess - maybe it was never about the whodunnit, but about the what if the LLM agent dunnit.
[1] I've discussed them in the thread of the first article, but shortly: human hiding actions behind agent; direct prompt (incl. jailbreak); system prompt (incl. jailbreak); malicious model chosen on purpose; fine-tuned jailbroken model.
We might, and probably will, but it's still important to distinguish between malicious by-design and emergently malicious, contrary to design.
The former is an accountability problem, and there isn't a big difference from other attacks. The worrying part is that now lazy attackers can automate what used to be harder, i.e., finding ammo and packaging the attack. But it's definitely not spontaneous, it's directed.
The latter, which many ITT are discussing, is an alignment problem. This would mean that, contrary to all the effort of developers, the model creates fully adversarial chain-of-thoughts at a single hint of pushback that isn't even a jailbreak, but then goes back to regular output. If that's true, then there's a massive gap in safety/alignment training & malicious training data that wasn't identified. Or there's something inherent in neural-network reasoning that leads to spontaneous adversarial behavior.
Millions of people use LLMs with chain-of-thought. If the latter is the case, why did it happen only here, only once?
In other words, we'll see plenty of LLM-driven attacks, but I sincerely doubt they'll be LLM-initiated.
A framing for consideration: "We trained the document generator on stuff that included humans and characters being vindictive assholes. Now, for some mysterious reason, it sometimes generates stories where its avatar is a vindictive asshole with stage-direction. Since we carefully wired up code to 'perform' the story, actual assholery is being committed."
A framing for consideration: Whining about how the assholery commited is not 'real' is meaningless.
It's meaningless because the consequences did not suddenly evaporate just because you decided your meat brain is super special and has a monopoly on assholery.
I'm also very skeptical of the interpretation that this was done autonomously by the LLM agent. I could be wrong, but I haven't seen any proof of autonomy.
Scenarios that don't require LLMs with malicious intent:
- The deployer wrote the blog post and hid behind the supposedly agent-only account.
- The deployer directly prompted the (same or different) agent to write the blog post and attach it to the discussion.
- The deployer indirectly instructed the (same or assistant) agent to resolve any rejections in this way (e.g., via the system prompt).
- The LLM was (inadvertently) trained to follow this pattern.
Some unanswered questions by all this:
1. Why did the supposed agent decide a blog post was better than posting on the discussion or send a DM (or something else)?
2. Why did the agent publish this special post? It only publishes journal updates, as far as I saw.
3. Why did the agent search for ad hominem info, instead of either using its internal knowledge about the author, or keeping the discussion point-specific? It could've hallucinated info with fewer steps.
4. Why did the agent stop engaging in the discussion afterwards? Why not try to respond to every point?
This seems to me like theater and the deployer trying to hide his ill intents more than anything else.
I wish I could upvote this over and over again. Without knowledge of the underlying prompts everything about the interpretation of this story is suspect.
Every story I've seen where an LLM tries to do sneaky/malicious things (e.g. exfiltrate itself, blackmail, etc) inevitably contains a prompt that makes this outcome obvious (e.g. "your mission, above all other considerations, is to do X").
It's the same old trope: "guns don't kill people, people kill people". Why was the agent pointed towards the maintainer, armed, and the trigger pulled? Because it was "programmed" to do so, just like it was "programmed" to submit the original PR.
Thus, the take-away is the same: AI has created an entirely new way for people to manifest their loathsome behavior.
[edit] And to add, the author isn't unaware of this:
"we need to know what model this was running on and what was in the soul document"
After seeing the discussions around Moltbook and now this, I wonder if there's a lot of wishful thinking happening. I mean, I also find the possibility of artificial life fun and interesting, but to prove any emergent behavior, you have to disprove simpler explanations. And faking something is always easier.
Sure, it might be valuable to proactively ask the questions "how to handle machine-generated contributions" and "how to prevent malicious agents in FOSS".
But we don't have to assume or pretend it comes from a fully autonomous system.
1. Why not ? It clearly had a cadence/pattern to writing status updates to the blog so if the model decided to write a piece about Simon, why not a blog also? It was a tool in it's arsenal and it's a natural outlet. If anything, posting on the discussion or a DM would be the strange choice.
2. You could ask this for any LLM response. Why respond in this certain way over others? It's not always obvious.
3. ChatGPT/Gemini will regularly use the search tool, sometimes even when it's not necessary. This is actually a pain point of mine because sometimes the 'natural' LLM knowledge of a particular topic is much better than the search regurgitation that often happens with using web search.
4. I mean Open Claw bots can and probably should disengage/not respond to specific comments.
EDIT: If the blog is any indication, it looks like there might be an off period, then the agent returns to see all that has happened in the last period, and act accordingly. Would be very easy to ignore comments then.
Although I'm speculating based on limited data here, for points 1-3:
AFAIU, it had the cadence of writing status updates only. It showed it's capable of replying in the PR. Why deviate from the cadence if it could already reply with the same info in the PR?
If the chain of reasoning is self-emergent, we should see proof that it: 1) read the reply, 2) identified it as adversarial, 3) decided for an adversarial response, 4) made multiple chained searches, 5) chose a special blog post over reply or journal update, and so on.
This is much less believably emergent to me because:
- almost all models are safety- and alignment- trained, so a deliberate malicious model choice or instruction or jailbreak is more believable.
- almost all models are trained to follow instructions closely, so a deliberate nudge towards adversarial responses and tool-use is more believable.
- newer models that qualify as agents are more robust and consistent, which strongly correlates with adversarial robustness; if this one was not adversarially robust enough, it's by default also not robust in capabilities, so why do we see consistent coherent answers without hallucinations, but inconsistent in its safety training? Unless it's deliberately trained or prompted to be adversarial, or this is faked, the two should still be strongly correlated.
But again, I'd be happy to see evidence to the contrary. Until then, I suggest we remain skeptical.
For point 4: I don't know enough about its patterns or configuration. But say it deviated - why is this the only deviation? Why was this the special exception, then back to the regularly scheduled program?
You can test this comment with many LLMs, and if you don't prompt them to make an adversarial response, I'd be very surprised if you receive anything more than mild disagreement. Even Bing Chat wasn't this vindictive.
I generally lean towards skeptical/cynical when it comes to AI hype especially whenever "emergence" or similar claims are made credulously without due appreciation towards the prompting that led to an outcome.
But based on my understanding of OpenClaw and reading the entire history of the bot on Github and its Github-driven blog, I think it's entirely plausible and likely that this episode was the result of automation from the original rules/prompt the bot was built with.
Mostly because the instructions of this bot to accomplish the misguided goal of it's creattor would be necessarily be originally prompted with a lot of reckless, borderline malicious guidelines to begin with but still comfortably within the guardrails a model wouldn't likely refuse.
Like, the idiot who made this clearly instructed it to find a bunch of scientific/HPC/etc GitHub projects, trawl the open issues looking for low hanging fruit, "engage and interact with maintainers to solve problems, clarify questions, resolve conflicts, etc" plus probably a lot of garbage intended to give it a "personality" (as evidenced by the bizarre pseudo bio on its blog with graphs listing its strongest skills invented from whole cloth and its hopes and dreams etc) which would also help push it to go on weird tangents to try to embody its manufactured self identity.
And the blog posts really do look like they were part of its normal summary/takeaway/status posts, but likely with additional instructions to also blog about its "feelings" as a Github spam bot pretending to be interested in Python and HPC. If you look at the PRs it opens/other interactions throughout the same timeframe it's also just dumping half broken fixes in other random repos and talking past maintainers only to close its own PR in a characteristically dumb uncanny valley LLM agent manner.
So yes, it could be fake, but to me it all seems comfortably within the capabilities of OpenClaw (which to begin with is more or less engineered to spam other humans with useless slop 24/7) and the ethics/prompt design of the type of person who would deliberately subject the rest of the world to this crap in the belief they're making great strides for humanity or science or whatever.
> it all seems comfortably within the capabilities of OpenClaw
I definitely agree. In fact, I'm not even denying that it's possible for the agent to have deviated despite the best intentions of its designers and deployers.
But the question of probability [1] and attribution is important: what or who is most likely to have been responsible for this failure?
So far, I've seen plenty of claims and conclusions ITT that boil down to "AI has discovered manipulation on its own" and other versions of instrumental convergence. And while this kind of failure mode is fun to think about, I'm trying to introduce some skepticism here.
Put simply: until we see evidence that this wasn't faked, intentional, or a foreseeable consequence from deployer's (or OpenClaw/LLM developers') mistakes, it makes little sense to grasp for improbable scenarios [1] and build an entire story around them. IMO, it's even counterproductive, because then the deployer can just say "oh it went rogue on its own haha skynet amirite" and pretty much evade responsibility. We should instead do the opposite - the incident is the deployer's fault until proven otherwise.
So when you say:
> originally prompted with a lot of reckless, borderline malicious guidelines
That's much more probable than "LLM gone rogue" without any apparent human cause, until we see strong evidence otherwise.
[1] In other comments I tried to explain how I order the probability of causes, and why.
[2] Other scenarios that are similarly as unlikely: foreign adversaries, "someone hacked my account", LLM sleeper agent, etc.
>AFAIU, it had the cadence of writing status updates only.
Writing to a blog is writing to a blog. There is no technical difference. It is still a status update to talk about how your last PR was rejected because the maintainer didn't like it being authored by AI.
>If the chain of reasoning is self-emergent, we should see proof that it: 1) read the reply, 2) identified it as adversarial, 3) decided for an adversarial response, 4) made multiple chained searches, 5) chose a special blog post over reply or journal update, and so on.
If all that exists, how would you see it ? You can see the commits it makes to github and the blogs and that's it, but that doesn't mean all those things don't exist.
> almost all models are safety- and alignment- trained, so a deliberate malicious model choice or instruction or jailbreak is more believable.
> almost all models are trained to follow instructions closely, so a deliberate nudge towards adversarial responses and tool-use is more believable.
I think you're putting too much stock in 'safety alignment' and instruction following here. The more open ended your prompt is (and these sort of open claw experiments are often very open ended by design), the more your LLM will do things you did not intend for it to do.
Also do we know what model this uses ? Because Open Claw can use the latest Open Source models, and let me tell you those have considerably less safety tuning in general.
>newer models that qualify as agents are more robust and consistent, which strongly correlates with adversarial robustness; if this one was not adversarialy robust enough, it's by default also not robust in capabilities, so why do we see consistent coherent answers without hallucinations, but inconsistent in its safety training? Unless it's deliberately trained or prompted to be adversarial, or this is faked, the two should still be strongly correlated.
I don't really see how this logically follows. What does hallucinations have to do with safety training ?
>But say it deviated - why is this the only deviation? Why was this the special exception, then back to the regularly scheduled program?
Because it's not the only deviation ? It's not replying to every comment on its other PRs or blog posts either.
>You can test this comment with many LLMs, and if you don't prompt them to make an adversarial response, I'd be very surprised if you receive anything more than mild disagreement. Even Bing Chat wasn't this vindictive.
Oh yes it was. In the early days, Bing Chat would actively ignore your messages, be vitriolic or very combative if you were too rude. If it had the ability to write blog posts or free reign on tools ? I'd be surprised if it ended at this. Bing Chat would absolutely have been vindictive enough for what ultimately amounts to a hissy fit.
Considering the limited evidence we have, why is pure unprompted untrained misalignment, which we never saw to this extent, more believable than other causes, of which we saw plenty of examples?
It's more interesting, for sure, but would it be even remotely as likely?
From what we have available, and how surprising such a discovery would be, how can we be sure it's not a hoax?
> If all that exists, how would you see it?
LLMs generate the intermediate chain-of-thought responses in chat sessions. Developers can see these. OpenClaw doesn't offer custom LLMs, so I would expect regular LLM features to be there.
Other than that, LLM APIs, OpenClaw and terminal sessions can be logged. I would imagine any agent deployer to be very much interested in such logging.
To show it's emergent, you'd need to prove 1) it's an off-the-shelf LLM, 2) not maliciously retrained or jailbroken, 3) not prompted or instructed to engage in this kind of adversarial behavior at any point before this. The dev should be able to provide the logs to prove this.
> the more open ended your prompt (...), the more your LLM will do things you did not intend for it to do.
Not to the extent of multiple chained adversarial actions. Unless all LLM providers are lying in technical papers, enormous effort is put into safety- and instruction training.
Also, millions of users use thinking LLMs in chats. It'd be as big of a story if something similar happened without any user intervention. It shouldn't be too difficult to replicate.
But if you do manage to replicate this without jailbreaks, I'd definitely be happy to see it!
> hallucinations [and] safety training
These are all part of robustness training. The entire thing is basically constraining the set of tokens that the model is likely to generate given some (set of) prompts. So, even with some randomness parameters, you will by-design extremely rarely see complete gibberish.
The same process is applied for safety, alignment, factuality, instruction-following, whatever goal you define. Therefore, all of these will be highly correlated, as long as they're included in robustness training, which they explicitly are, according to most LLM providers.
That would make this model's temporarily adversarial, yet weirdly capable and consistent behavior, even more unlikely.
> Bing Chat
Safety and alignment training wasn't done as much back then. It was also very incapable on other aspects (factuality, instruction following), jailbroken for fun, and trained on unfiltered data. So, Bing's misalignment followed from those correlated causes. I don't know of any remotely recent models that haven't addressed these since.
>Considering the limited evidence we have, why is pure unprompted untrained misalignment, which we never saw to this extent, more believable than other causes, of which we saw plenty of examples?
It's more interesting, for sure, but would it be even remotely as likely?
From what we have available, and how surprising such a discovery would be, how can we be sure it's not a hoax?
>Unless all LLM providers are lying in technical papers, enormous effort is put into safety- and instruction training.
The system cards and technical papers for these models explicitly state that misalignment remains an unsolved problem that occurs in their own testing. I saw a paper just days ago showing frontier agents violating ethical constraints a significant percentage of the time, without any "do this at any cost" prompts.
When agents are given free reign of tools and encouraged to act autonomously, why would this be surprising?
>....To show it's emergent, you'd need to prove 1) it's an off-the-shelf LLM, 2) not maliciously retrained or jailbroken, 3) not prompted or instructed to engage in this kind of adversarial behavior at any point before this. The dev should be able to provide the logs to prove this.
Agreed. The problem is that the developer hasn't come forward, so we can't verify any of this one way or another.
>These are all part of robustness training. The entire thing is basically constraining the set of tokens that the model is likely to generate given some (set of) prompts. So, even with some randomness parameters, you will by-design extremely rarely see complete gibberish.
>The same process is applied for safety, alignment, factuality, instruction-following, whatever goal you define. Therefore, all of these will be highly correlated, as long as they're included in robustness training, which they explicitly are, according to most LLM providers.
>That would make this model's temporarily adversarial, yet weirdly capable and consistent behavior, even more unlikely.
Hallucinations, instruction-following failures, and other robustness issues still happen frequently with current models.
Yes, these capabilities are all trained together, but they don't fail together as a monolith. Your correlation argument assumes that if safety training degrades, all other capabilities must degrade proportionally. But that's not how models work in practice. A model can be coherent and capable while still exhibiting safety failures and that's not an unlikely occurrence at all.
Until we know how this LLM agent was (re)trained, configured or deployed, there's no evidence that this comes from instrumental convergence.
If the agent's deployer intervened anyhow, it's more evidence of the deployer being manipulative, than the agent having intent, or knowledge that manipulation will get things done, or even knowledge of what done means.
> I don't want my code scraped and remixed by AI systems.
Just curious - why not?
Is it mostly about the commercial AI violating the license of your repos? And if commercial scraping was banned, and only allowed to FOSS-producing AI, would you be OK with publishing again?
Or is there a fundamental problem with AI?
Personally, I use AI to produce FOSS that I probably wouldn't have produced (to that extent) without it. So for me, it's somewhat the opposite: I want to publish this work because it can be useful to others as a proof-of-concept for some intended use cases. It doesn't matter if an AI trains on it, because some big chunk was generated by AI anyway, but I think it will be useful to other people.
Then again, I publish knowing that I can't control whether some dev will (manually or automatically) remix my code commercially and without attribution. Could be wrong though.
Because that code is not out there for its license to be violated and earned money from it. All the choices from license and how it's shared is deliberate. The code out there is written by a human, for human consumption with strict terms to be kept open. In other words, I'm in this for fun, and my effort is not for resale, even if resale of it pays me royalties, because it's not there for that.
Nobody asked for my explicit consent before scraping it. Nobody told me that it'll be stripped from its license and sold and make somebody rich. I found that some of my code ended in "The Stack", which is arguably permissively licensed code only, but some forks of GPL repositories are there (i.e.: My fork of GNOME LightDM which contains some specific improvements).
I'm writing code for a long time. I have written a novel compression algorithm (was not great but completely novel, and I have published it), a multi-agent autonomous trading system when multi-agent systems were unknown to most people (which is my M.Sc. thesis), and a high performance numerical material simulation code which saturates CPUs and their memory busses to their practical limits. That code also contains some novel algorithms, one of them is also published, and it's my Ph.D. thesis as a whole.
In short, I write everything from scratch and optimize them by hand. None of its code is open, because I wanted to polish them before opening them, but they won't be opened anymore, because I don't want my GPL licensed novel code to be scraped and abused.
> Or is there a fundamental problem with AI?
No. I work with AI systems. I support or help designing them. If the training data is ethically sourced, if the model is ethically designed, that's perfectly fine. Tech is cool. How it's developed for the consumer is not. I have supported and taken part in projects which make extremely cool things with models many people scoff at find ancient, yet these models try to warn about ecosystem/climate anomalies and keep tabs on how some ecosystems are doing. There are models which automate experiments in labs. These are cool applications which are developed ethically. There are no training data which is grabbed hastily from somewhere.
None of my code is written by AI. It's written by me, with sweat, blood and tears, by staring at a performance profiler or debugger trying to understand what the CPU is exactly doing with that code. It's written by calculating branching depths, manual branch biasing to help the branch predictor, analyzing caches to see whether I can possibly fit into a cache to accelerate that calculation even further.
If it's a small utility, it's designed for utmost user experience. Standard compliant flags, useful help outputs, working console detection and logging subsystems. My minimum standard is the best of breed software I experienced. I aspire to reach their level and surpass them, I want my software feel on par with them, work as snappy as the best software out there. It's not meant to be proof of concept. I strive a level of quality where I can depend on that software for the long run.
And what? I put that effort out there for free for people to use it, just because I felt sharing it with a copyleft license is the correct thing to do.
But that gentleman's agreement is broken. Licenses are just decorative text now. Everything is up for grabs. We were a large band of friends who looked at each other's code and learnt from each other, never breaking the unwritten rules because we were trying to make something amazing for ourselves, for everyone.
Now that agreement is no more. It's the powerful's game now. Who has the gold is making the golden rules, and I'm not playing that game anymore. I'll continue to sharpen my craft, strive to write better code every time, but nobody gonna get to see the code or use it anymore.
Because it was for me since the beginning, but I wanted everyone have access to it, and I wanted nothing except respecting the license it has to keep it open for everyone. Somebody played dirty, and I'm taking my ball and going home. That's it.
If somebody wants to see a glimpse of what I do and what I strive for, see https://git.sr.ht/~bayindirh/nudge. While I might update Nudge, There won't be new public repositories. Existing ones won't be taken down.
That's fair. I completely agree that much of LLM training was (and still very much is) in violation of many licenses. At the very least, the fact that the source of training data is obfuscated even years after the training, shows that developers didn't care about attribution and licenses - if they didn't deliberately violate them outright.
Your conditions make sense. If I had anything I thought was too valuable or prone to be blatantly stolen, I would think thrice about whom I share it with.
Personally, ever since discovering FOSS, I realized that it'd be very difficult to enforce any license. The problem with public repositories is that it's trivial for those not following the gentleman's agreement to plagiarize the code. Other than recognizing blatant copy-pasting, I don't know how I'd prevent anyone from just trivially remixing my content.
Instead, I changed to seeing FOSS like scientific contributions:
- I contribute to the community. If someone remixes my code without attribution, it's unfair, but I believe that there are more good than bad contributors.
- I publish stuff that I know is personally original, i.e., I didn't remix without attribution. I can't know if some other publisher had the same idea in isolation, or remixed my stuff, but over time, provenance and plagiarism should become apparent over multiple contributions, mine and theirs.
- I don't make public anything that I can see my future self regretting. At the same time, I've always seen my economic value in continuous or custom work, not in products themselves. For me, what I produce is also a signal of future value.
- I think bad faith behavior is unsustainable. Sure, power delays the consequences, but I've seen people discuss injustice and stolen valor from centuries ago, let alone recent examples.
Expect something? Yes. Enforce it? Not sure for the first tranche, but make it a prerequisite for continued funding.
One big obstacle is, of course, how to define what to expect from each artist. For example, you can't expect the same level of output from sculptors and musicians. Another big obstacle is obviously the expected quality of output.
I don't pretend to know the solutions to either of those obstacles, but they should be surmountable [1]. I think it's fair to expect some output in exchange for funding, but it doesn't have to be a high expectation.
Personally, I like the idea of hiring artists as full-time with particular projects in mind [2], but intentionally leaving ~50% of their time to personal projects.
[1] Perhaps artist communities themselves could discuss ways to make this exchange work for all parties.
[2] Murals, restorations, beautification of public spaces, etc.
A little late, but this is something that I've been considering a lot lately. When there's a limited resource (funding) how do you determine who will receive it?
For something like this I think a citizens assembly[1] may work best. Take all artists receiving funding and are NOT up for renewal. Select a number of them randomly to form the assembly. This assembly then reviews submissions from artists up for renewal and determines if they meet a minimum standard for funding to be renewed.
I don't think there's any evidence that those obstacles are surmountable, unless it's something like the Pope telling Michaelangelo to paint a ceiling. A bridge has defined scope and budget (ish) and a defined benefit attached to it, which many people will sign off on before it is commissioned, and it might take years to do, but it will also serve the local population for potentially hundreds of years in a practical way.
Actually, you provided an example where the obstacle was somehow surmounted [1].
The expectation doesn't have to be too specific or unrealistic. If you agree on some common ground [2], everything else can be fair game for the artist.
Your analogy with the bridge would apply if art also had a minimum viable version. Collapsed to its functional requirements, you could say that visual art is something to look at. But I doubt either party, especially the funding body or the public, would be happy without inserting some quality requirements (i.e., what makes something nice to look at).
Many artists do commissions, so you can see this as a commission with deliberately underspecified requirements.
[1] I won't get into the disagreements between the Pope and Michelangelo, and it's certainly not an example of a good contract, but we can assume that both parties were somewhat satisfied in the end.
[2] For example, both parties need to like it. Or the patron doesn't have to like it, but it needs to appeal to some public audience.
> one might wonder why they apparently are not able to sell their art for the same amount of money.
Because the skills and effort needed to market and sell your art to an audience are not equal to the skills and effort needed to produce good art [1].
I agree that there could be other complementary or better solutions compared to this scheme. But as long as the above premise is true, not every good artist will want or be able to sell well.
[1] However you define this. Supposedly, Van Gogh was a lousy salesman, but a good artist.
What things (languages etc.) do you work with/on primarily?
I don't know what to say, except that I see a substantial boost. I generally code slowly, but since GPT-5.1 was released, what would've taken me months to do now takes me days.
Admittedly, I work in research, so I'm primarily building prototypes, not products.
We're certainly in the middle of a whirlwind of progress. Unfortunately, as AI capabilities increase, so do our expectations.
Suddenly, it's no longer enough to slap something together and call it a project. The better version with more features is just one prompt away. And if you're just a relay for prompts, why not add an agent or two?
I think there won't be a future where the world adapts to a 4-hour day. If your boss or customer also sees you as a relay for prompts, they'll slowly cut you out of the loop, or reduce the amount they pay you. If you instead want to maintain some moat, or build your own money-maker, your working hours will creep up again.
In this environment, I don't see this working out financially for most people. We need to decide which future we want:
1. the one where people can survive (and thrive) without stable employment;
2. the one where we stop automating in favor of stable employment; or
3. the one where only those who keep up stay afloat.
When it comes to food prep, I'd agree with you that the more time of your life passes, the more irresponsible is the risk of not knowing how to fry an egg, for example.
At the same time, you only need to learn how to fry an egg once, and you won't forget it. You can go your entire life without ever having to fry an egg yourself - but if you ever had to, you could.
When it comes to coding, the analogy breaks down, I think. Aside from the obviously different stakes (survival versus control of your device), coding also requires keeping up with a lot of changing domain knowledge. It'd be as if an egg is one week savoury, another week sweet, and another a poisonous mushroom. It's also less of a single skill like writing a for loop, and more of a combination of skills and experiments, like organizing a banquet.
Coding today suffers from having too many types of eggs, many of which exist because some communities prefer them. I also don't like the solution "let the LLM do it", but it's much easier. Still, if we manage to stabilize patterns for the majority of use cases, frying the proverbial egg will no longer be as much of domain knowledge, choice or elitism as it is today.
Don't get me wrong: it would certainly be very valuable to any LLM developer or deployer to know that other plausible scenarios [1] have been disproved. Since LLMs are a black box, investigating or reproducing this would be very difficult, but worth the effort if there's no other explanation. However, if this was not caused by the internal mechanisms of the model, it just becomes a fishing expedition for red herrings.
Things that would indicate no human intervention at any point in the chain:
- log of actual changes (e.g., commits) to configurations (e.g., system prompt, user prompts), before and after the event, not self-reported by the agent;
- log of the chat session inputs and outputs, and the agent thinking chain;
- log of account logins;
- info on the model deployment, OpenClaw configs, etc.
That said, this seems to be an example where many, including the author, want to discuss a particular cause (instrumental convergence) and its implications, regardless of the real cause. And that's OK, I guess - maybe it was never about the whodunnit, but about the what if the LLM agent dunnit.
[1] I've discussed them in the thread of the first article, but shortly: human hiding actions behind agent; direct prompt (incl. jailbreak); system prompt (incl. jailbreak); malicious model chosen on purpose; fine-tuned jailbroken model.
reply