RLHF and fine-tuning! While these methods make prompting more accessible and approachable to people unfamiliar with LLMs and otherwise expecting an omniscient chatbot, they make the underlying dynamics a lot more unstable. Personally, I prefer the untuned base models. In fact, I depend upon a set of high-quality prompts (none of which are questions or instructions) which perform similarly across different base models of different sizes (e.g., GPT-2-1.5B, code-davinci-002, LLaMA-65B, etc.) but frequently break between different instruction-tuned models and different versions of the _same_ instruction-tuned model (I think Google's Flan-T5-XXL has been the only standout exception in my tests, consistently outperforming its corresponding base model, and although it's not saying much, I admit that GPT-4 does do a lot better than GPT-3.5-turbo in remaining consistent across updates).