Hacker News .hnnew | past | comments | ask | show | jobs | submit | andai's commentslogin

I keep trying to switch to the Chinese models, but I keep finding myself asking Claude to fix their outputs. (Both functionality and style.) So I always end up switching back.[0]

I also keep trying GPT, which is quite solid. Very fast, great at debugging. But its code is often overly clever and hurts my brain.

(Maybe fixable with prompting. I tried and it helped the Chinese ones a bit. Just tell them do be elegant, like in the old image AI days "+good -bad"!)

For now I do still need my human brain to actually be able to make sense of the stuff, and Claude is the only one that consistently meets that requirement.

But I am hoping that one of these days, one of the Chinese labs figures out the special sauce :)

--

[0] (For smallish edits, though, I am having a great time with DeepSeek Flash. Practically unlimited AI on tap! How cool is that.)


They should have made it three times bigger instead of two.

Alternatively you can just give it its own user. I do that, so it can blow up its own files, but not mine.

>I'm continually bemused and astonished by the number of people who clearly acknowledge that it's reckless to give agents full access to your machine, and keep doing it anyway.

Yeah, that's why you give it its own machine :)


I wrote this to a friend in 2022:

Here's an idea: reverse kickstarter

1. people post ideas

2. good ideas go viral

3. people pledge actual money to encourage someone to step forward and build it

4. interested creators make kickstarter type videos explaining their proposal for making the thing

5A. people vote on which proposal to accept, or maybe

5B. each backer can select a project to support

---

Here steps 4 and 5 are replaced by Claude.

Cool idea!


Shouldn't we still have people in the loop for selecting/proposing the best implementation (plan)? Vibe coding an entire solution from a prompt still doesn't feel like the optimal way to write software.

At some point you have to say "Is not having it better than having it?" Where's your dude, today, who's gonna code this? If it were gone happen, it wouldve.

We are not there yet... and Fable makes me feel like it will be a while before we get there.

It also occurs to me now that the Claude version has about 3 fewer zeros at the end of the funding target.

That seems to massively lower the bar for people investing.


not with prompts but "loops" maybe.

I often hear people say lately, "why should I bother to read this, if you didn't even think it was worth writing?"

I've been thinking about this in art. Is it the end result that matters, or the process of creating it?

I once saw a hideous sculpture. Didn't like it at all. Then the video zoomed and I saw that the whole thing (quite massive) had been hand-built out of individual toothpicks, and suddenly I thought it was amazing.

Perhaps an even better example: I read a story of a man in india who carved a passage through a mountain, so there would be a shorter route from his remote village to the city. He did it by hand and it took him 20 years. We seem to have an instinctive admiration for heroic effort.

In business, generally only the end result matters. Although, the end result also includes the client's perception of how the product was made... (see also: fake fairtrade etc.) In a meaningful way, the perception, the story, is reality.


I don't think it's a matter of process vs end result. I just want to feel that a human with taste judged that it was worth my attention.

If a human put some effort into it, that's a signal.


This is mostly what it is for me too. We're all awash in an information deluge, and we need heuristics to keep from drowning. Human effort, proof-of-work if you will, is a heuristic that helps with the AI-generated part of the deluge.

Your boss cares only about the end result. Good engineers care about the process too

> Is it the end result that matters, or the process of creating it?

I think this comment misses the point. Let's forget about AI and assume that there are three developers: A, B, and C. Now, A is supposed to make a PR, but instead they describe it to B, and B writes the code. C reviews the PR and gives feedback. A passes the feedback and the responses between B and C.

As you see, this is not easy for either B or C, and A is totally useless in this scenario. When you replace B with an LLM that doesn't get tired or bored, only C complains about the process.


> Is it the end result that matters, or the process of creating it?

One of the main reasons that art is valuable is in its ability to communicate emotions. Good art has the ability to serialize emotions within the artist and deserialize them within the mind of the viewer. It's not just "wow, this is a pretty picture", it's "wow, this is how another person sees the world, and now that I understand that, I feel an intimate connection with them".


> Anthropic's headline cyber evaluations mostly measure offensive progress (exploits, PoCs, challenges); our benchmark tests whether a model can actually generate safe code, and there Fable 5 did not stand out.

The model isn't allowed to think about security. I heard several people here mention that if it starts thinking about security -- e.g. writing tests related to it -- the safety filter flags it and downgrades to Opus.

So it's actually not allowed to make your code secure.


A reviewer can only test the model they have access too. They should not speculate about what the model could have done without provider tampering. I think Anthropic's mistake here was not calling it Fable 5 Preview, because now people can write headlines about how Fable 5 is worse than Opus.

Yeah. Fable apparently found bugs in my C code but Anthropic wouldn't allow it to test them, fix them or even tell me what the problem was. The memory safety parts of my Fable code review were 50% Opus. Even the coordinator Fable that just launched the code review agents got downgraded to Opus for some reason.

Model is definitely better than Opus but Anthropic's delivering a pretty terrible experience.


> So it's actually not allowed to make your code secure.

Anything designed to prevent a problem will eventually cause one.


Interesting. The reasoning models were super weird and robotic. They toned that down a bit in GPT-5.x, especially the later ones.

I always assumed the strange style was an artefact of the RLVR.


I think they were extremely scared of 4o at that point, and were scared it could trigger some horrible event. Documented cases of severe psychosis because of AI started to surface at that time.

Just imagine what would've happened if a major terrorist attack was a result of someone getting mentally ill from AI, without the safety filters recognizing the danger.

The robotic tone was probably from over-correcting the sycophantic tendencies of 4o.


I think they've brought back a "personality" of sorts to ChatGPT 5.x. I've caught it more than once explaining something to me and saying "In my personal opinion", or "I personally enjoy <thing> the most". Which is always jarring, it doesn't "personally" or "enjoy" anything. We could be discussing videogames and it tells me which games "it personally enjoys the most". Bizarre.

What's the point of the scratch pad? Isn't the same data already in the context? Or does it help because contexts are lossy and bias towards the start and end?

Similar question with the to-do list. Do they actually help task completion? Is there any research on that? I think they're less helpful with more recent models, but maybe they still help with smaller ones?

The system prompt asking it to make a plan before starting work does sound helpful though. (Of course it would also be great to see numbers there :)


Hi, There are few benefits from using scratchpad or any other external platform : - Be agnostic of the LLM you use, tomorow, if the prices of the llm you use are exploding, you can still reuse another LLM by pointing it the scratchpad repository you have. Then, modulo the level of verbosity you had on scratchpad (or other), you'll avoid lossing time ro reexplain everything to the new llm - You can avoid the "summarized" effect obtained through context compaction events . This effect makes accurate and so potential important information a bit more lurry (numbers turned into adjectives, etc/ Scratchpad or Obsidian or any other external solution you might imagine would act as "case fact blocks" that are a solution recommended to mitigate the above effect and thus make the accurate information still available. You can imagine a system where you ask your LLM to read some files within your external storage after each compaction for exemple with a hook or anything else.

Regarding the todolist, from my pov, it's just a basic principle of work segmentation and accuracy with some traceability. You are better when you can divide your work into chunks that can be followed individually rather than with a huge block of work. That can also be used within the "ralph wiggum" loop pattern that might help the llm to get a goal and thus iterate until goal completion. There are few articles explaining the concept if that interests you

Hope it helps a bit !


Great article.

Main lesson seems to be, it's good to put icons on the standard, most frequently used actions. And make them colorful. That helps the eye find them.

Edit: also consistency and legibility. So basically "don't design it so it's bad!"


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: