> However, the model can be reluctant to answer questions based on an individual sentence in a document, especially if that sentence has been injected or is out of place
>We achieved significantly better results on the same evaluation by adding the sentence “Here is the most relevant sentence in the context:”
It kind of feels like them telling us that we're using the model wrong and that by prompting the Assistant with the first part of the retrieval completion the model will outperform versus asking for single sentence retrieval.
This needs to be shown. For example, asking for something that is clearly in the training data (like Paul Grahams cv) is certainly not a proper way to test context recall
That is the point. Long book, checking the long context to see if remembers about the first sentence. Or you mean as a test it is better to randomly place the "needle"?
It's much more intuitive if you gritted your teeth and your wallet and played extensively with pre ChatGPT: in a sentence, it's the stochastic parrot nature of it. It is statistical autocomplete at the end of the day, even though thats usually deployed in a sneering tone.
You can do yourself massive favors by setting up the conversation such that what you need logically flows from the context. In the other case, they're just asking "what's the most fun thing to do in San Francisco" after throwing a bunch of Paul graham essays at it. Its hard to explain but it's sort of intuitive that a bunch of seemingly unrelated sections of text followed by simply "what is the most fun thing to do in San Francisco", a very subjective and vague question, in the context of a "conversation", would often not result in a precise lookup of a one-off sentence before
There's a sense of empathy that can kinda play into it. Ex. If I was asked to read 250 pages of Paul Graham essays, then asked to answer what the most fun thing to do in San Francisco is, I wouldn't immediately think that meant I should check what Paul Graham says the most fun thing to do in San Francisco was
What was the point of moving away from the base model? I can't stop asking this question. Conversational formatting is achievable with careful prompting and a bit of good old-fashioned heuristic post-processing, and it was easier to achieve consistent results before RLHF took off. Now we still have to do a bunch of prompt hacking to get the results we want[1], but it's more complicated and the performance of the model has degraded significantly[2]. All the cargo culting toward agentic chatbots and away from language prediction engines might please the marketing and investor relations departments, but it's only setting us back in the long run.
Are you asking why use RLHF? It's a way to improve step by step reasoning. They are training a reward model to understand problem solving step by step, instead of just training reward model on the outcome. They then tune the model based on this reward model. It's shown to greatly improve performance on reasoning.
The reward models are kind of forgotten by everyone, but they are substantial transformer models with billions of parameters themselves. I think companies are using RLHF because it really helps align preferences/steer/improve performance.
I recommend reading the articles I linked as what you're saying is not true for most use cases. RLHF as implemented by OpenAI improves performance for one particular use case: chatbots. For every other use case, it degrades performance. The priority for OpenAI right now is to favor perceived performance in turn-based conversation over actual predictive performance, which unfortunately hinders my own usage of an otherwise spectacular base model.
Not for GPT-4, unfortunately. Although, I'm certainly happy that Davinci et al remain available. I just wish they'd committed harder to what they had with code-davinci-002.
>We achieved significantly better results on the same evaluation by adding the sentence “Here is the most relevant sentence in the context:”
It kind of feels like them telling us that we're using the model wrong and that by prompting the Assistant with the first part of the retrieval completion the model will outperform versus asking for single sentence retrieval.