Hacker News .hnnew | past | comments | ask | show | jobs | submit | achierius's commentslogin

One of those rare papers where the code speaks for itself. They do a bunch of comparisons but the most salient is comparing Karpathy's autoresearch (verbatim, best as I can tell) vs. some HPO algorithms, and as of yet the Tree-structured Parzen estimator still wins out -- but just barely!

More interesting though is that the best results come from 'centaur' approaches, where an LLM is hooked up with a standard HPO. Somewhere around 1:3 LLM:HPO control seems to work best, with more LLM control degrading performance. But either way this method far outperforms either the naive autoresearch loop or the bare HPO approach.


> Centaur outperformed all methods including CMA-ES alone by using the LLM on only 30% of trials. The LLM receives CMA-ES's full internal state (mean vector, step-size, covariance matrix), the top-5 configurations, and the last 20 trials. A 0.8B LLM already suffices to outperform all classical and pure LLM methods. Scaling from 0.8B (0.9766) to 27B (0.9763) to Gemini Pro (0.9767) yields no improvement, suggesting a capability plateau [which Claude sliightly beats]

> We ablate the LLM ratio: higher ratios degrade performance, confirming that CMA-ES should retain majority control.


I think a lot of non-vibe-coding types also hold similar opinions -- in fact they might dislike Anthropic products even more, given that they (however few they might be) choose not to use them.

You honestly think “Anthropic employees are script kiddies with inflated egos that are high on their own supply” is a reasonable stance?

This seems such an immature take to me, and hard to take serious. Anthropic just a bunch of script kiddies? Really?


Claude Code is a vibe-coded product that doesn't seem to be undergoing regression tests.

It looks like they're running it in the loops then ship whatever looks the coolest.

How is this not "high on own supply"?


Why the insults/hostility? Why call them script-kiddies? Why the inflated egos?

How do you know what testing procedures they use? Do you honestly think they're running some kind of Ralph loop without any testing and just ship whatever looks the coolest? Really ?


> How do you know what testing procedures they use?

We don’t, but we can see the end result, so we know whatever they do isn’t adequate and it suggests they value shipping fast over quality or even listening to customer feedback.

> Do you honestly think they're running some kind of Ralph loop without any testing and just ship whatever looks the coolest? Really ?

No, but given how sharply the quality has been dropping over the past few months and how it suspiciously coincided with the time they admitted that Claude code is now 100% vibe coded, it certainly doesn’t feel too far off.

I’ve personally found the code that the AI writes, even this week (ie not some old models from months ago) to be shockingly shoddy. I’ve rewritten some AI code (created via spec driven development and a workflow that includes planning and refactoring) by hand and I’ve been very conscious of the amount of micro-design-changes I as a human make where the AI just blows forward shoehorning a solution into the design. My implementation happens b has adjusted and shifted many times to insure clear and performant logic, while the AI commits to an approach early and applied whatever brute force is necessary to make it work. I’ve also asked it to write various tests for me or to make isolated changes and quite frankly the code was just not very good. Working, but convoluted. Even with guidance and iteration, it’s still not on a human level.

So it’s not hard to see that if you have an application as large and complex as Claude code and you let the AI do it all, that it’s going to be a mess.

I’m not against using AI for development, but you have to be realistic about its capabilities. I feel like this is where they “got high on their own supply” and are blinded to the AI’s shortcomings and failures.


They’ve said themselves that Claude code is 100% vibe coded now. That certainly meets the criteria of “script kiddies” and “high on their own supply”. The negative connotations are there on purpose because of the bugs and issues that these products have, something which presumably they wouldn’t have if there was human oversight and acknowledgement that the AI isn’t infallible.

> They’ve said themselves that Claude code is 100% vibe coded now. That certainly meets the criteria of “script kiddies”

That's not what script kiddies are at all.

> The negative connotations are there on purpose because of the bugs and issues that these products have, something which presumably they wouldn’t have if there was human oversight and acknowledgement that the AI isn’t infallible.

That's a big assumption, given that Anthropic is also currently growing by more than 3x per quarter. Maybe the problem is more complicated and we don't know everything, and they're also just simply suffering from growth pains?


> That's not what script kiddies are at all.

Sure it is. The new age of script kiddies: they don’t know how to do it for themselves, but they can run a script (or tell the AI to) to do it for them.

> That's a big assumption

We can only see the results, which are more and more bugs, problems, regressions, etc. That’s not normal behavior. Yes all we can do is speculate, we don’t know the real reasons for the issues, but it’s clear there are issues and they appear to be getting worse.


> You honestly think “Anthropic employees are script kiddies with inflated egos that are high on their own supply” is a reasonable stance?

Maybe not the script kiddies part, but "high on their own supply" is certainly not unreasonable.


I don’t understand the hostility and insulting tones being reasonable now.

The comment is not at all just saying “their usage of their own AI is causing these issues”, it’s just a lot of hostility, I don’t see the value of these kind of insults.


> I don’t understand the hostility and insulting tones being reasonable now.

Maybe it's just interpretation: "high on their own supply" is no different from "poisoned by their own dogfood" or similar.

It means that they have completely committed to a thing that the person proffering the quote thinks is "wrong" in some way.


lol "hostility" - they sell a very high profile product and the issues seem to reflect bad engineering culture. therefore, I say their culture smells bad.

I just want you to know that I read over this thread and you are obviously completely right. This sort of incurious, immature stance is something I've seen become the norm on HN over the last few years, particularly when it comes to AI.

I am neither immature nor incurious.

The fact that this was their "malware checker" is proof they don't realistically use their LLM and that they aren't actually using engineering rigor.


I didn't say anything like that! Like I said I just don't think that this opinion is somehow associated with "vibe coders"; if anything I'd expect the opposite.

Seems reasonable to me

Not if the big labs have anything to say about it! They're working to fix the 'problem', and with Mythos we no longer have any guarantees that the frontier will even be available to distill.

Do you not think people here work at big companies with big products? I do, and we have a much higher bar for shipping.

>> My overall feel is that people underestimate the complexity of the systems at Anthropic and the chaos of the growth.

> Do you not think people here work at big companies with big products? I do, and we have a much higher bar for shipping.

This form of comment (The "Do you not think {X}?") comes across as a swipe (discouraged by the HN guidelines). It doesn't respond to the strongest plausible interpretation of my comment (also in the guidelines).


That's fair. I'll adjust and say that I think there's a mix: some people certainly are bashing without understanding, but there are also a lot of engineers here whose day to day work is held to a higher standard than I think we see coming out of Anthropic, at least w.r.t. the product side of things (obviously the models are great).

Thanks. Along those lines, here's a sort of thought experiment. Of said engineers who know a higher standard, say we teleported them into Anthropic, what are some likely scenarios?

- How much time would they need to import their standards into Anthropic? ... things like tooling, process, culture, hiring, etc? Maybe externally-sourced discipline and rigor are the missing catalysts. [1]

- OTOH, it seems possible these engineers (many of which are used to certain levels of stability, sanity, internal tooling, etc) would be destabilized by Anthropic's problems, the scale, the rate of hiring, the rate of customer growth.

- Perhaps Anthropic needs new instrumentation to cover end-to-end customer metrics? More internal tool-building teams? A new ops team? A new org structure? I don't know.

The growth, the environment has put Anthropic into a position where these kinds of mistakes are just statistically inevitable ... unless they chose to grow more slowly.

So my overall hunch (very few people really grok the constellation of factors at Anthropic) is fuzzy. That's why I'm trying to lay out some of the questions that underlie it, without resorting to simplistic notions of blame (which paper over the deeper causes).

Lastly, can you think of comparable scenarios with this kind of growth where companies don't have major hiccups? This is driving towards thinking about the outside view [2]. Roughly speaking: don't expect to "beat the market" for long. Entropy wins.

[1]: I recently watched a video where Steve Jobs described a time in early Macintosh history where Apple tried to "professionalize" its management. Hiring proven managers didn't work, so they shifted towards hiring for cultural fit and letting them grow the management skills.

[2]: https://www.lesswrong.com/w/inside-outside-view


What people dislike is the boom-bust cycle inherent to all levels of a market economy. During some years, these companies suck people up like a vacuum -- that can be bad if you're on the inside and all of a sudden the culture goes out the window, or if you're expected to onboard 3-4 people at the same time, or you end up with a reorg every quarter. Then, on the other end of the spectrum, companies shut down (non-backfill) hiring entirely and layoff huge percentages of the company, with no guarantee that you'll be safe just because you're doing a good job.

Human lives do not work like this. If you're getting married, if you have an unexpected hospital expense, if you want to buy a house -- these are not things that "market cycles" will plan around, but you have to.

Being quick to hire or fire is not the problem. Massive overhiring and massive layoffs are.


Maybe not 1/10, but definitely on-the-order-of 1/4th or 1/6th as many.

We aren't building dozens of new datacenters to host more webapps.

> Assembly to Python creates a lot of Intent & Cognitive debt by his definition, because you didn't think through how to manipulate the bits on the hardware, you just allowed the interpereter to do it

I agree! You often see this realized when projects slowly migrate to using more and more ctypes code to try and back out of that pit.

In a previous job, a project was spun up using Python because it was easier and the performance requirements weren't understood at that time. A year or two later it had become a bottleneck for tapeout, and when it was rewritten most of the abstract architecture was thrown out with it, since it was all Pythonic in a way that required a different approach in C++


GP meant moving the driver into userspace, which is much less painful due to the stable userspace APIs.

I’m not sure the GP did mean that, but I agree it’s a much better solution than maintaining an out-of-tree kernel module, which is generally a really bad idea

> And the fact that having outline calls to methods of value objects is so expensive

Is this tied to unions? Or otherwise, when does this happen? I don't see the connection w/ invisicaps or &c


In Fil-C, currently, all stack allocations that “escape” need to be allocated in the heap.

“Escape” is defined very loosely; it currently means: some function other than the one that owns the stack allocation needs a pointer to that allocation.

For example even if you could prove that `bar(Value* p)` never stashes p anywhere, the fil-C compiler will currently heap allocate that value anytime bar is called. The one exception is if bar had already been inlined, and so from the FilPizlonator’s perspective there isn’t even a call.

This is clearly dumb and fixable. It’s dumb because lots of functions aren’t worth inlining but their body is analyzable. Slow paths are like that. It’s fixable because those slow paths - and lots of code like them - takes ptrs as arguments and then obviously just uses them for loads and stores but doesn’t escape them any further.

You’ll sometimes hear me say that Fil-C is nowhere near as optimal as it could be. This is just one example of that


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: