OpenAI have some degree of versioning with the models used by their APIs, but it seems they are perhaps still updating (fine tuning) models without changing the model name/version. For ChatGPT itself (not the APIs) many people have reported recent regressions in capability, so it seems the model is being changed there too.
As people start to use these API's in production, there needs to be stricter version control, especially given how complex (impossible unless you are only using a fixed set of prompts) it is for anyone to test for backwards compatibility. Maybe something like Ubuntu's stable long-term releases vs bleeding edge ones would work. Have some models that are guaranteed not to change for a specified amount of time, and others that will be periodically updated for people who want cutting edge behavior and care less about backwards compatibility.
Although there are APIs used in production whose output changes/evolves over time.
First one that comes to mind is Google Translate. We spend 6-figures with the Google Translate API annually, and recently we went back and checked if Google Translate is improving/changing the translations over time, and found that they indeed are (presumably as they improve their internal models, which aren't versioned and isn't exposed in a changelog anywhere). The majority of translations were different for the same content today compared to 6 months ago.
I don't particularly agree with this approach. Speaking as a power user of Google Translate API, it would be nice to be able to pin to specific version/model and then manually upgrade versions (with a changelog to understand what's changing under the hood).
At this point the changelog is surely just stuff along the lines of "retrained with more data" and "slightly tweaked model architecture in a random way that improved performance".
And out of curiosity: as someone with a lot of expertise and money on the line, how would you compare Google Translate with LLMs? And also smaller self-hosted models with bigger ones that require API access like OpenAI? Do they perform better or worse and are they cheaper or more expensive?
I wouldn't assume that. BERT uses the original encoder-decoder transformer architecture which takes advantage of the entire input sequence being available, so you have the forward context as well as backward context of a word to assist with translation.
GPT uses the decoder-only transformer architecture and masked attention so that it only relies on backward context, which allows it to predict word-by-word. It's not a better architecture, just one adapted for different requirements.
If you are able to use the forward context, then I'd assume that a BERT-like architecture would do better since it has more information available.
* use some version of GPT, to preprocess the prompts
* send preprocessed prompt to “real” model for inference
* post process result to filter out undesired output
Then they can honestly say they haven’t changed the version of the model, without telling you that they have probably changed a lot of their pipeline to process your prompts and deliver a result for you
In the last few years I've noticed lying by omission has become the new fun corporate/gen-z internet trend (see also: overfunding/gofundme fraud). Like priests and fund managers, their product is a black box, and there's a lot of mischief you can get into when you're entrusted with one of those. They play a fucked-up game of Rumpelstiltskin, where they mislead by default and only admit the truth if you can guess the right question to ask.
You're on the right track, and I too think that's what their actual pipeline looks like, but you're missing a step. I think there's another step where they effectively alter the output of production models by hot-swapping different LORAs (or whatever) to them.
This lets them plausibly claim they haven't changed the version of the model, because they haven't messed with the model. They messed with middleware, which nobody knows enough about to press them on. You ask them if anything changed with the model/API, they say no, and leave you to think you're going crazy because shit's just not working like it was last week.
Nobody's asking them about changes to middleware though, which genuinely surprises me. I am never the smartest person in the room-- only the most skeptical.
Regarding your comment on updating models, are you saying you think they're updating the "pinned" models. EGT gpt4-0314?
Otherwise, I think effectively they already have what you're describing with LTS models as pinned versions, and the unversioned model is effectively the "nightly."
From what I've see in the materials, it seems like if you pay for dedicated compute, you can also have some control over your model versions.
Sure but they could be playing with semantics a bit as well. When they say "models" they might just mean the LLM that it's all based on. But there's a long more going on in the pipeline to turn that into a consumer facing service like ChatGPT. They might have changed any combination of the following:
1) Fine Tuning
2) Embedding
3) The initializing prompt
4) Filtering a prompt prior to ingestion & tokenization of the prompt
5) Filtering the output from the application after it has generated a response.
The statement "we have no unannounced changes to the models" can be true while still substantially changing functionality & response quality through any of the 5 above areas, and probably some I missed.
I'd be so pleasantly surprised if the same model has remained the same. Everytime I see speedups in its generation speed I assume they've distilled the model further. The outputs subjectively also feel weaker. Surely someone has compared the outputs from the beginning and now?
Like 4 months ago people were saying the Singularity has pretty much already happened and everything is going to change/the world is over, but here we are now dealing with hard and very boring problems around versioning/hardening already somewhat counter-intuitive and highly-engineered prompts in order to hopefully eek out a single piece of consistent functionality, maybe.
When a newer LLM model comes (e.g GPT3.5 to GPT4), your old prompts become obsolete. How are you solving this problem in your company? Are there companies working on solving this problem?
I have to prompt engineer a lot more with 3.5 than 4. The way I asked questions and convey what I want tends to be much more structured with 3.5, in a less natural way than I can do with 4. Hopefully 4 would be even better at answering a structured prompt like that, but also maybe not: For quick questions with a short answer 3.5 will sometimes give a simpler answer than 4, but 3.5 is correct. 4 isn't necessarily wrong, but it sort of reads into the question a bit more, the the answer is less succinct, more caveats and nuances explained, etc. In examples like this even though both give a correct answer, the one from 4 may be undesirable. You don't want to have to read through an extra paraph to pick out the answer to your question. There's more frictions.
Of course the above scenario is easily solved: Change your prompt to include "Be Brief", but that's exactly the argument-- the old prompt is at least in part obsolete and much changes to achieve functional equivalency in 4. And the you need to check for unanticipated changes to the answer that "be brief" would cause: maybe it would now be too brief! Maybe not, but you have to have some method of checking.
The performance of the model can be improved with tweaks to the prompt, but the tweaks end up being model-specific. This is why "prompt engineering" exists for productionized use cases instead of people just spitting words semi-randomly into a textbox. Your old prompts probably won't completely fail but they'll behave differently under a different model.
RLHF and fine-tuning! While these methods make prompting more accessible and approachable to people unfamiliar with LLMs and otherwise expecting an omniscient chatbot, they make the underlying dynamics a lot more unstable. Personally, I prefer the untuned base models. In fact, I depend upon a set of high-quality prompts (none of which are questions or instructions) which perform similarly across different base models of different sizes (e.g., GPT-2-1.5B, code-davinci-002, LLaMA-65B, etc.) but frequently break between different instruction-tuned models and different versions of the _same_ instruction-tuned model (I think Google's Flan-T5-XXL has been the only standout exception in my tests, consistently outperforming its corresponding base model, and although it's not saying much, I admit that GPT-4 does do a lot better than GPT-3.5-turbo in remaining consistent across updates).
Prompts read like natural language, but you can’t always write them the way you’d write for a human. Here are some concrete examples of semantically similar prompts being interpreted quite differently by an LLM. https://twitter.com/mitchellh/status/1645562198935347205 And these chaotic, butterfly-effect areas are going to be different for different models, which is what prompted (lol) the original question.
They can. Results change as you change your models, and results aren't always strictly better or worse, which is why testing gold-standard results with any prompt and model changes is so important for applications utilizing LLMs.
Prompts are natural language, but you're using them with the model in a way similar to getting a split-second gut feel reaction from a human - that reaction may very well vary between people.
This sounds like making diffusion backwards compatible with ESRGAN. Technically they are both upscaling denoisers (with finetunes for specific tasks), and you can set up objective tests compatible with both, but actual way they are used is so different that its not even a good performance measurement.
The same thing applies to recent LLMs, and the structural changes are only going to get more drastic and fundamental. For instance, what about LLMs with seperate instruction and data context? Or multimodal LLMs with multiple inputs/outputs? Or LLMs that finetune themselves during inference? That is just scratching the surface.
There's loads of ways you could do this. e.g. cross attention, completely separate tokens for prompt and data, arguably special tokens to surround system prompts (GPT-4 api does this)
> If you expect the models you use to change at all, it’s important to unit-test all your prompts using evaluation examples.
It's mentioned earlier in the article, but I'd like to emphasize that if you go down this route that you should either do multiple evaluations per prompt and come up with some kind of averaged result, or set the temperature to 0.
FTA:
> LLMs are stochastic – there’s no guarantee that an LLM will give you the same output for the same input every time.
> You can force an LLM to give the same response by setting temperature = 0, which is, in general, a good practice.
Temperature = 0 will give deterministic results, but might not be as “creative”. Also it’s not enough to guarantee determinism , hardware executing the LLM can lead to different results as well
In terms of being part of a test suite, I think determinism > creativity in the response. But I would agree there's probably rough edges there, it's possible that some prompts never perform well with temperature set to 0.
Setting the temperature to 0 is not good practice for most tasks. It's great if you are doing a multiple choice benchmark, but for most generation tasks the output will be noticably worse, in particular more repetitative.
I suggest this is the wrong way to think about this. Alexa tried for a very long time to agree on a “Alexa Ontology” and it just doesn’t work for large enough surface areas. Testing that new versions of LLMs work is better than trying to make everything backward compatible. Also, the “structured” component of the response (e.g.: send your answer in JSON format), should be something not super brittle. In fact if the structure takes a lot of prompting to work, you are probably setting yourself up.
LMQL helps a lot with this kind of thing. It makes it really easy to swap prompts and models out, and in general it allows you to maintain your prompt workflows in whatever way you maintain the rest of your python code.
I’m expecting there will be more examples soon, but you can check out my tree of thoughts implementation below to see what I mean
As people start to use these API's in production, there needs to be stricter version control, especially given how complex (impossible unless you are only using a fixed set of prompts) it is for anyone to test for backwards compatibility. Maybe something like Ubuntu's stable long-term releases vs bleeding edge ones would work. Have some models that are guaranteed not to change for a specified amount of time, and others that will be periodically updated for people who want cutting edge behavior and care less about backwards compatibility.