Current LLMs likely have multiple world models with varying qualities. Good prompts are required to activate suitable models for each task.
1. Did you use GPT-3.5 (free version) or GPT-4 (paid one) to form the judgment? Their performances on harder tasks differ significantly as shown in https://openai.com/research/gpt-4.
2. Have you tried adding "Please think step by step" to some harder requests? This simple phrase gets most current LLMs to perform significantly better. It's a bit like asking students to show their work, which forces them to think more clearly.
Current LLMs, without additional mechanisms, tend to be perform like a drunk or sleepy human, i.e. using mostly intuition or System-1 (as defined in "Thinking Fast and Slow"). The prompt such as "think step by step" asks it to think more in the System-2 style. (There are other techniques which get them to perform even better still.)
I think of current LLMs as a very-well-read, but often sleepy intern who needs strict instructions, feedback, and sometimes extra training if you want them to perform well.
As stated in another response, I'm using 3, so I'm probably missing some good stuff.
I have tried adding things like "please think step by step" (quite literally that question actually), and also "please make sure you check the facts before answering so you don't include non-existing arguments" (when asking about a cli tool that takes arguments), but I didn't notice a significant improvement.
I like what you say in your last line, though I think a big challenge is that a very-well-read intern got to be very-well-read for at least two reasons: 1) they like to read and can do that a lot (LLMs can do this, not saying they like, but saying training them is analogous to someone reading a lot), 2) they went through a curated list of reading material. My experiences make me think part 2 is a weak part of chatgpt.
I still think LLMs are not the way towards an interesting AI. I don't know why we're insisting so much on natural language. I mean, I can understand this for simpler tasks like a support chatbot, but I wish there was (maybe there is?) good research on building an AI that is not based on human language, since it's one of the worst mediums to communicate with rigor.
1. Did you use GPT-3.5 (free version) or GPT-4 (paid one) to form the judgment? Their performances on harder tasks differ significantly as shown in https://openai.com/research/gpt-4.
2. Have you tried adding "Please think step by step" to some harder requests? This simple phrase gets most current LLMs to perform significantly better. It's a bit like asking students to show their work, which forces them to think more clearly.
Current LLMs, without additional mechanisms, tend to be perform like a drunk or sleepy human, i.e. using mostly intuition or System-1 (as defined in "Thinking Fast and Slow"). The prompt such as "think step by step" asks it to think more in the System-2 style. (There are other techniques which get them to perform even better still.)
I think of current LLMs as a very-well-read, but often sleepy intern who needs strict instructions, feedback, and sometimes extra training if you want them to perform well.