> However, the model can be reluctant to answer questions based on an individual...

jafitc · on Dec 7, 2023

No, what it’s showing is that synthetic tests where Claude didn’t perform well can still work if prompted right.

But at the end of the day the test was still synthetic!

Placing out-of-context things in a 200k document, needle in a haystack style.

Claude is still very very powerful for extracting data from 200k when it’s real world data and real questions (not adversarial synthetic test).

zwaps · on Dec 7, 2023

This needs to be shown. For example, asking for something that is clearly in the training data (like Paul Grahams cv) is certainly not a proper way to test context recall

jafitc · on Dec 7, 2023

Link from thread https://dev.to/zvone187/gpt-4-vs-claude-2-context-recall-ana...

mejutoco · on Dec 7, 2023

Could we feed it Anna Karenina and ask it what is a difference between happy and unhappy families?

jafitc · on Dec 7, 2023

Isn’t that the first sentence?

mejutoco · on Dec 7, 2023

That is the point. Long book, checking the long context to see if remembers about the first sentence. Or you mean as a test it is better to randomly place the "needle"?

zwaps · on Dec 7, 2023

It was trained on this book so again, this is not a good test

It will know the answer even without the book

refulgentis · on Dec 7, 2023

It's much more intuitive if you gritted your teeth and your wallet and played extensively with pre ChatGPT: in a sentence, it's the stochastic parrot nature of it. It is statistical autocomplete at the end of the day, even though thats usually deployed in a sneering tone.

You can do yourself massive favors by setting up the conversation such that what you need logically flows from the context. In the other case, they're just asking "what's the most fun thing to do in San Francisco" after throwing a bunch of Paul graham essays at it. Its hard to explain but it's sort of intuitive that a bunch of seemingly unrelated sections of text followed by simply "what is the most fun thing to do in San Francisco", a very subjective and vague question, in the context of a "conversation", would often not result in a precise lookup of a one-off sentence before

There's a sense of empathy that can kinda play into it. Ex. If I was asked to read 250 pages of Paul Graham essays, then asked to answer what the most fun thing to do in San Francisco is, I wouldn't immediately think that meant I should check what Paul Graham says the most fun thing to do in San Francisco was

jafitc · on Dec 7, 2023

Brain is just neurons and synapses at the end of the day.

The whole universe might just be a stochastic swirl of milk in a shaken up mug of coffee.

Looking at something under a microscope might make you miss its big-picture emergent behaviors.

cosmojg · on Dec 7, 2023

What was the point of moving away from the base model? I can't stop asking this question. Conversational formatting is achievable with careful prompting and a bit of good old-fashioned heuristic post-processing, and it was easier to achieve consistent results before RLHF took off. Now we still have to do a bunch of prompt hacking to get the results we want[1], but it's more complicated and the performance of the model has degraded significantly[2]. All the cargo culting toward agentic chatbots and away from language prediction engines might please the marketing and investor relations departments, but it's only setting us back in the long run.

[1] https://arxiv.org/pdf/2310.06452.pdf

[2] https://arxiv.org/pdf/2305.14975.pdf

computerex · on Dec 7, 2023

Are you asking why use RLHF? It's a way to improve step by step reasoning. They are training a reward model to understand problem solving step by step, instead of just training reward model on the outcome. They then tune the model based on this reward model. It's shown to greatly improve performance on reasoning.

The reward models are kind of forgotten by everyone, but they are substantial transformer models with billions of parameters themselves. I think companies are using RLHF because it really helps align preferences/steer/improve performance.

cosmojg · on Dec 12, 2023

I recommend reading the articles I linked as what you're saying is not true for most use cases. RLHF as implemented by OpenAI improves performance for one particular use case: chatbots. For every other use case, it degrades performance. The priority for OpenAI right now is to favor perceived performance in turn-based conversation over actual predictive performance, which unfortunately hinders my own usage of an otherwise spectacular base model.

jafitc · on Dec 7, 2023

OpenAI provides “instruct” version of their models (Not optimized for chat)

cosmojg · on Dec 12, 2023

Not for GPT-4, unfortunately. Although, I'm certainly happy that Davinci et al remain available. I just wish they'd committed harder to what they had with code-davinci-002.

_boffin_ · on Dec 7, 2023

If it worked for Steve Jobs, maybe they're thinking it could work for them too?