Pardon my ignorance but couldn't this also be an act of anthropomorphisation on human part?
If an LLM generates tokens after "What do you call someone who studies the stars?" doesn't it mean that those existing tokens in the prompt already adjusted the probabilities of the next token to be "an" because it is very close to earlier tokens due to training data? The token "an" skews the probability of the next token further to be "astronomer". Rinse and repeat.
I think the question is: by what mechanism does it adjust up the probability of the token "an"? Of course, the reason it has learned to do this is that it saw this in training data. But it needs to learn circuits which actually perform that adjustment.
In principle, you could imagine trying to memorize a massive number of cases. But that becomes very hard! (And it makes predictions, for example, would it fail to predict "an" if I asked about astronomer in a more indirect way?)
But the good news is we no longer need to speculate about things like this. We can just look at the mechanisms! We didn't publish an attribution graph for this astronomer example, but I've looked at it, and there is an astronomer feature that drives "an".
We did publish a more sophisticated "poetry planning" example in our paper, along with pretty rigorous intervention experiments validating it. The poetry planning is actually much more impressive planning than this! I'd encourage you to read the example (and even interact with the graphs to verify what we say!). https://transformer-circuits.pub/2025/attribution-graphs/bio...
One question you might ask is why does the model learn this "planning" strategy, rather than just trying to memorize lots of cases? I think the answer is that, at some point, a circuit anticipating the next word, or the word at the end of the next line, actually becomes simpler and easier to learn than memorizing tens of thousands of disparate cases.
If an LLM generates tokens after "What do you call someone who studies the stars?" doesn't it mean that those existing tokens in the prompt already adjusted the probabilities of the next token to be "an" because it is very close to earlier tokens due to training data? The token "an" skews the probability of the next token further to be "astronomer". Rinse and repeat.