Hacker News .hnnew | past | comments | ask | show | jobs | submit | highfrequency's commentslogin

Why UK?

Too many cameras and not enough food.

> idea thief. Takes credit for others’ work.

Based on what?


GP is a serial troll whom we've banned hundreds of times, so the answer to your question is almost certainly nothing.

I think this is a hard question if you ask people to start providing proof for things like this. A lot of such opinions are usually baked into individuals personal experience and perception. Nevertheless one has to feel very strongly to share such a take here in this manner unless they are gaining something from it.

This perspective is only relevant if we assume nobody on HN ever posts maliciously. Some of the circles here are small, incestuous, and probably have some resentment. Other times, there's clear botting - very hard to talk against Elon or his companies without a load of down votes.

Needless to say, the OP could be right but they could be right without proof. Or proof would out them. Or it's malicious posting. Don't take anything on the internet too seriously, even in such sanctimonious spaces as HN.


I always thought it was people who were married to their book, or just fanboys. Some of these accounts that jump in to defend are quite old.

> Nevertheless one has to feel very strongly to share such a take here in this manner unless they are gaining something from it.

Do they really? What does it cost them if they're wrong?


What? If I call someone a thief, I should be able to point to something they stole.

There is a big difference how one acts in a court and in real life. The original statement could either be a slandering (hard to know what they get to gain from it) or its their bitter experience/perception that they feel strongly about, are are sharing on a platform like this.

sandeepkd is an idea thief. Also, he is short guy but walks with high shoes so people will think he is tall.

In a court of law sometimes. In the real world some facts have verifiable proof but the majority have little if anything that can be shared publicly or exists.

If someone called my friend a thief and couldn’t even point to what they stole, I’d mercilessly judge them even outside a court of law. That’s a serious accusation. Going off vibes is totally inappropriate.

I wish it were ok for companies to bluntly say: “we made these decisions for competitive reasons, but the public backlash outweighed that so we are reversing course.”

I think it’s normal and morally fine for companies to want to protect their leadership position. I find the process of creating narratives that justify these decisions as something chosen for the good of others is a little tedious.


Unless I'm missing something, this argument seems to apply only to the original pretraining era (eg GPT 1-4). The post-training and reinforcement learning paradigms are clearly doing variation, evaluation and selective retention no?


RLVR still does not expand beyond the base distribution though, it only mode-seeks within it.

i.e, evaluation, retention yes. variation or "planning" no.

That is not to say you cannot use LLMs. Alpha evolve does exactly that. It uses an external simple evolutionary planner though. The overarching point he's making is that our planner is still "dumb" and we need to work on it.

When you iteratively guide an LLM in claude code, you are the external planner. That also works.


> RLVR still does not expand beyond the base distribution though, it only mode-seeks within it.

Seems clearly false. Pretraining finds the mean/mode of the data distribution. RL can easily generate many samples around that mode, evaluate them on an external source of truth (eg compile the code and run it) and then selectively train on the good samples. This clearly can go beyond the initial data distribution.


by base distribution, I meant the base model's output distribution


The model’s distribution will certainly change from the base model’s output distribution during reinforcement learning, shifting toward outputs that score well on an external evaluation. This is very different from mode-seeking. Am I missing something?


Mode-seeking is describing the way in which the distribution changes. RL is capable of picking out slightly lower probability trajectories and moving them toward the top of the distribution. However, exploration is fundamentally limited by the base policy itself. If a trajectory has near-zero probability under the original model, RLVR is unlikely to discover it because it must first be sampled before it can be rewarded. External search/planning methods such as MCTS or evolutionary search are useful precisely because they can explore candidate trajectories beyond what the policy would ordinarily generate. This is also not theoretical, GRPO style methods are shown to mostly improve `maj@k` and `pass@1` evals while not so much `pass@k` especially for high k, meaning it mostly sharpening the top of the distribution.

I'm not saying this makes it useless - it clearly helps for math and coding tasks. But the ceiling exists and that's what the original tweet was referring to. Alpha evolve also shows what lies beyond the ceiling, altho their planner was rudimentary.


Sure, but I'd say that moving desirable trajectories from very low probability to high probability is characteristic of genuine human learning and discovery. Technically, quantum gravity, a bestselling novel, or a yet undiscovered proof of the Riemann Hypothesis is "in my distribution", but when we are talking about a long chain of unlikely token completions (with multiplicative probabilities), whether that trajectory lives in the tail of the distribution vs. in the mode makes all the difference.

Would you agree that it is a matter of degrees rather than a qualitative distinction? There seems to be a broad misconception in Sutton and others that output quality cannot exceed that of the base internet distribution; my point is that RL allows you to easily produce an output distribution that is better than whatever data you trained on according to some evaluation criteria. There are no clear theoretical limits on how much better it can get, rather there are many people asserting guesses that there is an upper bound and it lives below "human creativity." I just haven't seen any solid theoretical argument, and the empirical evidence has so far shown continual improvement.

Also, I would be keen to look at any sources you have of pass@k not improving much during GRPO.


I said slightly lower, I meant it. It's virtually impossible to sample a trajectory that is really really low probability (say, by smoothening the distribution before sampling) without incurring crazy amounts of noise. And only when you sample it, can you reward it and do the update.

Again, no one is saying models can't improve beyond the internet i.e data distribution! They clearly can. The claim is that RL without real exploration cannot exceed the base models distribution, which by virtue of SGD _does_ generalize.

And also, it doesn't mean it's not useful. Improving sample efficiency and making something that happens 1 in 15 times happen 1 in 1.2 times is insanely useful and is what has enabled the kind of coding agents we have today.

Sutton, especially, I doubt has a misconception about this :)

> pass@k

Yeah, AFK now. But it's a well researched thing. You can look for more, but here's one off the top of my head: https://openreview.net/forum?id=4OsgYD7em5 The original deepseek paper also had the result, i.e the paper that first got famous for using grpo as a method that works for LLMs. A side result in one of these papers I forget which one, is that the base model converges in performance with the RLd one at high k.


Thanks, I appreciate the discussion. The paper you sent is interesting. I agree it looks like for moderate values of K (on the order of 100-1000), RL models actually look a little worse at pass@k than their base models.

So perhaps the right framing of your/Sutton's claim is: RL can upweight low-probability (p) but correct outputs, but there is a limit to how small p can be, and it is on the order of 1 in a 100 or 1 in a 1000. Implicitly there must be some crossover point where you would call this discovery/creativity if it works for sufficiently small p right? Eg if RL can upweight a correct but 1 in a trillion output to 1 in 5, that's got to count as discovery given that all possible sequences are technically "in the distribution"?

In practice, it does seem like that kind of progress is happening. For example with the recent Erdos solution [0], I would wager that GPT 4's hit rate on this would have been functionally 0 (certainly less than 1 in a thousand). Curious to hear whether you'd still say this is mode-seeking within a base distribution, or if not then what is the right explanation if not iterative RL.

I'd also highlight that the paper you linked with the pass@k equivalence doesn't technically address the question of how small p can be before RL upweighting breaks down - all of the example problems were easy enough that the base model had decent hit rate with 128 tries.

[0] https://openai.com/index/model-disproves-discrete-geometry-c...


Pass@128 is a lot. They were not easy.

> Discovery / creativity

I'm absolutely uninterested in the semantic discussions of what is a real discovery, what is creativity, what is intelligence, etc. I simply don't care. If it's useful great use it. If it's not great don't.

> How small p can be

All that depends on your sampling procedure. If you intentionally smooth the distribution out you can sample the smallest thing, but you pay for it with noise. Taken to an extreme, this is the monkeys typing on the keyboard argument.

It's a mathematical fact that RL cannot improve things it doesn't sample. In any learned distribution you pay a heavy cost by sampling far away from the mode. Most RL algos sample rollouts maybe with some smoothing but that's it. This is why external planners are necessary in order to sample something effectively un-sampleable in the base distribution. Simple example: tool use!

Sutton and everyone are simply calling for a focus on improving these external planners in the same way, as they also enable much better "continual" learning and so on.

> Erdos solution

The RL was what enabled such a huge trajectory to ever become efficiently sampleable in our lifetimes probably. You can do many useful things like this and more purely with the base model distribution.

In fact. Doing RL on user chats and so on especially from pair coding sessions are improving these models coding abilities by a lot making them even more reliable for SWE. In this regard, mode-seeking is a win.

> All sequences are technically in distribution

If it was truly improving 1 in million things systemically, then you wouldn't see base getting the same results given many samples. Albeit they are not erdos problems.

Could it be that at 1T scale, and for difficult problems specifically, grpo somehow filters through the noise and picks out the 1 in trillion? Extremely unlikely (you have your expected rollouts required to sample that, and then you have your sparse reward signal and no credit assignment on top of that...). But of course, only 2 companies in the world can do experiments with it, so there could be some unknown effect the rest of the world has not seen. Barring that, no.


The transcript does seem to overlook post-training steps like Reinforcement Learning with Verifiable Rewards (RLVR) (but I'll certainly won't claim that Rich Sutton is unaware of such things; RLVR has a very narrow set of evaluation approaches).

I wonder if this is a precursor to Keen Tech leaning into David Silver's Ineffable Intelligence approach.


This was exactly what I was thinking of. RLVR is the secret sauce behind o3 and its many successors.

Its the secret sauce behind why the current models are so great at coding and soon to be unbeatable at math.

LLMs can pose many questions and if they are easily verifiable, fine tune very heavily. A lot of the world models discussion will inevitable lean into simulations as verification.


I'll admit that I miss having access to the ChatGPT 4.5 "absolutely gigantic model" with enough tuning to make it sane and useful. The RLVR models are superb for actual tasks in those RLVR domains, but that fine tuned view of the world as a verifiable problem to solve makes them feel worse for touchy feely stuff. Even for medical consultation and diagnosis, RLVR model's urge to reach a conclusion often is a liability.


Fable 5/Mythos 5 is the next "big chungus LLM".

It's RLVR tuned, but not to the ChatGPT level of brain damage, and it's still backed by a fuck off huge pool of model weights - which matters for what you call "touchy feely stuff".


In particular, good to marinate beyond the (typically 6 month) lockup period during which investors cannot sell stock


Makes a lot of sense that Musk should do the parts of the AI stack that look more like manufacturing/regulatory bottlenecks, and rent out the compute to research-focused AI labs. Does anyone know the full accounting of how much it cost to build Colossus (plus ongoing opex) vs. the revenue it's generating now?


> now they’ve thrown down the gauntlet directly challenging frontier labs by training their own model (“much larger” than Kimi 2.5’s 1T parameters) from scratch.

To clarify, the model Composer 2.5 announced in this post is not that; it uses Kimi 2.5 as a strong starting point. This is not to discount Cursor's work or future ambitions, but one of the most striking things about the last 6 months is that multiple open-source models/labs are now within striking distance of the frontier closed-sourced labs.

See eg Kimi 2.6 benchmarks: https://www.kimi.com/blog/kimi-k2-6


> LLMs have got to the point where if a problem has an easy argument that for one reason or another human mathematicians have missed (that reason sometimes, but not always, being that the problem has not received all that much attention), then there is a good chance that the LLMs will spot it.


What’s wrong with the finance team (vibe) coding a janky prototype for planning?


Because the next email is going to be "we demoed the app to our CEO and he loved it and he wants it in production". I have never seen the team back down from their ideas.

And also, I have been in similar sounding scenarios multiple times. They talk big when everything is going smoothly and nothing is on the line. The day shit hits the fan, they will furiously message me on Teams and insist that I support them in finding out issues. So far it has been about mostly about shitty design choices. This is at whole new level. They want to vibe code an app which will be used to plan and guide company's direction for the next year.


prototypes/mvp's often become the production version


And their spreadsheets don't?


This seems to be the default path which is encouraged/suggested lately, only happy path until you acquire customers


I think it’s fairly normal at least in my career - rush to ship something, lots of “we’ll polish this later,” two years go by, get called into vp/cto/whoever’s office when the debt comes calling like “what the fuck why is this like this???” and I have to say “that ‘later’ we decided is now I guess”


The script I have fairly seen being played is where the one doing MVP gets rewarded and moves on with a promotion. The weight of completing and stabilizing the MVP falls on some one else who is not vocal enough in terms of influence. Ironically the flashy MVP does not includes monitoring, logging, security, edge-cases, CI-CD, DR, scaling which is why vibe coding is getting so popular and everyone seems to be under the impression that engineers are not needed anymore.


...and are often still in place when the "magic guy who built it" left long time ago...


Why do you prefer the laptop to be thicker and heavier?


Nobody said that.

MacBooks of that period made compromises for useless gain in thinness. You can't with straight face tell that butterfly mechanism was a good tradeoff for .3 mm.


I don't want to think about how long I used that macbook where the keycaps would come off with my fingers as I typed, the switches were that broken.

It's like thinking about how much time I lost using a 2010 10" Atom netbook for development as a poor student where I'd close down all apps to watch a youtube video, and "rails server" took five minutes to boot on hello world.


That's a false dichotomy; there are plenty of keyboards that don't require recalls due to issues like the butterfly ones but also don't have the issues you're describing.


Luckily there are two lines: the Air and the Pro.

The issue people had was from 2016-2019, the Macbook Pros sacrificed a lot of usability for thinness, when that should only happen for the Airs.


I'd be fine with a thinner and lighter laptop if it was without compromises.

But having a shitty keyboard, losing the HDMI port, wasn't worth it.


Right? What was the point of a laptop with no "ugly ports" if everyone instead needed to carry around a stupid dongle to hang off it?


I think the preference is to have a battery that can run a CPU that's compiling, AI-ing, or rendering for an entire day (16+ hours) without having to worry about where an outlet is or being tethered to a wall or be thermal throttled. Right now that's a volume tradeoff. If there was something that ran as fast for as long and was MacBook Air (or the last Intel generation) thin, I don't think anyone would complain.


My old thinkpad was thicker but not heavier. Way more ports, didn't need dongles.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: