Did it? How do you know it didn't confidently miss important documents?
My experience so far has just been asking chatgpt questions and then researching it myself to confirm what it says, so maybe I'm missing something. But, it has been confidently wrong on important details a large enough percentage (right now) to make it absolutely not a fire and forget tool.
The worst part is the confidence: it's like having a coworker that just straight up lies to your face randomly. Even if it's only 5% of the time you basically can't trust anything they say, and so you need to double check all of it.
This doesn't make it useless, but it means it lends itself to "hard to do but easy to verify" tasks. Which afaict your example is not: you can verify the documents it picked out are relevant, but not that the documents that it didn't, weren't.
On the other hand I can usually come up with my own estimate of how trustworthy the answer is when a human gave it to me, e.g. thanks to:
* their reputation with respect to the question domain (if I ask a basic C++ question to a C++ expert I'll trust them)
* their own communicated confidence and how good they are at seeing their own shortcomings (if they say "but don't quote me on that, better ask this other person who knows more" it's fine)
5% of bad answers doesn't matter if 99% of these times I knew I should look further. ChatGPT and others are missing this confidence indicator, and they seem to answer just as confidently no matter what.
To be clear I don't see a fundamental reason why LLMs couldn't compute some measure of confidence (which will itself be wrong from time to time but with less impact) so I expect this to be solved eventually.
Base gpt-4 already did this.(confidence about something directly correlated with ability to solve problem/answer questions correctly) You can read the technical paper. But the hammer of alignment(RLHF) took it away.
My experience so far has just been asking chatgpt questions and then researching it myself to confirm what it says, so maybe I'm missing something. But, it has been confidently wrong on important details a large enough percentage (right now) to make it absolutely not a fire and forget tool.
The worst part is the confidence: it's like having a coworker that just straight up lies to your face randomly. Even if it's only 5% of the time you basically can't trust anything they say, and so you need to double check all of it.
This doesn't make it useless, but it means it lends itself to "hard to do but easy to verify" tasks. Which afaict your example is not: you can verify the documents it picked out are relevant, but not that the documents that it didn't, weren't.