Monkey Island taught me English. I can't tell you how confusing insult sword fighting was initially. I had to create long tables with the correct answers because I didn't get most of the puns, and then I had to start from scratch when I had to fight Carla.
Yeah, it's crazy that there is no trustworthy source for model reviews. I'd love to know how well the new Deepseek 4 actually performs, for example, but I don't want to spend the next week testing it out. Reddit used to be a somewhat useful gauge, but now there are posts on how 4 is useless right next to posts on how amazing it is. And I have no idea if this is astroturfing, or somebody using a quantized version, or different workloads, or what.
I also find it increasingly difficult to evaluate the models I actually do use. Sometimes each new release seems identical or only marginally better than the previous version, but when I then go back two or three version, I suddenly find that oder model to be dramatically worse. But was that older model always that quality, or am I now being served a different model under the same version name?
One challenge is that model evaluation is typically domain/application specific. Model performance can also depend on the system prompt and the input/context.
Regarding evaluation, I've found using tools like promptfoo (and in some cases custom tools built on top of that) are useful. These help when evaluating new models/versions and when modifying the system prompt to guide the model. Especially if you can define visualizations and assertions to accurately test what you are trying to achieve.
This can be difficult for tasks like summarization, code generation, or creative writing that don't have clear answers. Though having some basic evaluation metrics and test cases can still be useful, and being able to easily do side-by-side comparisons by hand.
What is today right now in Australia? How about where you live? You have not thought enough about what you’re saying and are probably not aware of all the weird time issues we have in our world.
Also, isn't this just a huge fire hazard of they actually do what they claim? Or will they remove the batteries from these old, continually plugged in, poorly cooled laptops?
My theory is that YouTube blocks some accounts for publishing LLM-generated music, and people who wanted to earn ad money from it get burned and publish LLM-generated posts about it.
I would be on YouTube's side here, except it's possible that their motivation is simply to avoid poisoning their dataset while they train their models off creators videos. Also, the question is how they tell apart what's LLM-generated without false positives.
Maybe there were also artificial listens fraud (it's a problem with their competitor Spotify), but we'll never know because no one who was blocked would publish that honestly.
reply