Hacker News .hnnew | past | comments | ask | show | jobs | submitlogin

Step 1. Humans write copy for humans to buy their garbage, humans counter by tuning out and switching channels

Step 2. Humans write SEO copy for machines to rank them higher.

Step 3. LLM writes copy for machines to rank them higher.

Step 4. Human uses LLM to try to distill the LLM generated SEO spam for any remaining signal.

Also to your point:

> SEO listicle garbage filling the internet.

the feeling that the LLM is better than what you described is going to be very temporary, then the mountains of LLM generated bullshit is going to overwhelm even LLM to make meaningful sense of.



You're missing the point. If we want to know something, we won't even have to google it; we will just ask an LLM. There will be no market for websites full of it because we can just directly ask it to answer our questions.

The only "if" to all this is if we will destroy the LLMs by feeding them their own diarrhea. I expect a sort of natural selection here to play out, especially in the open source space. Ones that are trained on LLM generated blogspam will probably, I expect, get outperformed by ones that are trained on genuine information, or at the very least ones made using new techniques that adequately filter noise.


> If we want to know something, we won't even have to google it; we will just ask an LLM. There will be no market for websites full of it because we can just directly ask it to answer our questions.

How will it learn anything new?


> Ones that are trained on LLM generated blogspam will probably, I expect, get outperformed by ones that are trained on genuine information, or at the very least ones made using new techniques that adequately filter noise.

Yes, humans are notorious for only seeking out high quality, accurate data, especially when it conflicts with our priors.

To say nothing of our ability to assess the accuracy or truthiness of information in the first place (look at how many people take, on faith, that Chat GPT isn’t wrong as often as it is right).


But there's still no way to get an LLM to only output "fact", because that's not a property of language.


That's also true of a web search engine; but an LLM can (in principle, not saying it's there yet) be able to spot inconsistencies in the source data, to notice disagreement.


I’m not following. If ChatGPT gets worse, OpenAI can simply not update it. Or revert to a previous version.

For Google, they’re at the mercy of whatever the internet has.


ChatGPT is also at the mercy of whatever the internet has. Including more and more of what it was used to generate.


It isn’t though. Like I said, if the model gets worse OpenAI can simply not release a new version.

You also have to consider the money angle. As using ChatGPT and other chatbots becomes more popular, people will stop producing garbage internet articles because they will be less popular and therefore less profitable. Bloggers who enjoy writing will continue to do so because it was never about the money, they just enjoy writing.

Further, the internet is only one small portion of information available to train on. There’s a lot of other data out there, including real-world conversations.


> Like I said, if the model gets worse OpenAI can simply not release a new version.

So now it's got great information about the Model T Ford but knows nothing about our new mars colony?

I don't think "just don't update the model" is a likely option.


You don’t update models to add new information. That’s extremely inefficient and susceptible to catastrophic forgetting. If you want the model to have new information you update an offline knowledge base. So yes you can simply not update the model.


Huh? You won't update the model you'll just give it new information? The exact concern is that the new information will be garbage aimed at pushing the model to produce certain output. Much like SEO spammers do to manipulate Google search results.

"Just don't update the model, only feed it new information" is exactly how to get to the outcome of concern in this thread.


Yes, updating the model is different from updating the knowledge base the model uses.


Great, so you've updated your knowledge base, it's got garbage targeted to make it attractive to the model, and now your model is outputting garbage. It's the exact same problem Google has fighting the SEO spammers. Now the model is significantly less useful, exactly as suggested.

We've already seen exactly this happened with search. There's no reason to believe that LLMs are immune.


I understand what you are saying but to me it sounds very handwavy and (not to be disrespectful) naive.

How would LLM upstarts be able to counter the massive commercial interests? As with google they will also succumb to prefer money over usefulness at latest when they have a wide user base.

There is also an even less proven way of distinguishing spam from signal with LLMs.

And not updating a model means that they will be stuck in COVID-19 era forever.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: