It just seems like a bit of a cluster fuck, I have no idea who's leading what part, who to talk to. Everything just seemed a bit all over the place, I looked on discord, some people seemed to want all the data int he world, while others thought it should be a small foundational model.
I took a look at the annotator frontend the other day, yikes, that takes way to long to annotate, and not enough clarity on how to annotate. Sure if you get 10 people to annotate each task, you can avg the results, but will you get that many people? And you're calling it data collection, that's not data collection, that's data annotation.
Okay, why not use some of the existing models, to create some of these samples, and train on them.
I think you need:
- an architecture plan for information retrieval, search intent
- a better, faster to annotate annotator
- what data do you want to actually collect? only those 50k? or do you need to train a foundational model, or use an existing model?
What about some look at whats already been done? Like blenderBot, LangChain, etc. I love building stuff from scratch, but... at least some analysis, of the issues and problems, and why this method will work.
And also, I do love building stuff from the ground up