Facebook is honestly the least interesting crawler misbehaving right now. The real shift is GPTBot, ClaudeBot, PerplexityBot and a dozen other AI crawlers that don't even identify themselves half the time.
I've been monitoring server logs across ~150 sites and the pattern is striking: AI crawler traffic increased roughly 8x in the last 12 months, but most site owners have no idea because it doesn't show up in analytics. The bots read everything, respect robots.txt maybe 60% of the time, and the content they index directly shapes what ChatGPT or Perplexity recommends to users.
The irony is that robots.txt was designed for a world where crawling meant indexing for search results. Now crawling means training data and real-time retrieval for AI answers. Completely different power dynamic and most robots.txt files haven't adapted.
This matches what I've been noticing. A lot of AI crawler traffic just doesn't show up clearly in typical analytics dashboards, especially when tools aggressively filter or sample.
Part of why I built UXWizz was to avoid black-box filtering and keep control over how traffic is classified. When you own the analytics stack, you get to decide what’s "valid" instead of inheriting someone else's definition.
One underappreciated angle: the dot-com boom created new distribution channels that were broadly accessible. Anyone could put up a website and reach people. The AI boom is quietly reshuffling distribution in ways most companies haven't noticed yet.
We've been tracking how LLMs recommend products across categories. The correlation between Google ranking and LLM recommendation is near zero (0.08 across ~150 brands we tested). So a company that spent a decade building SEO authority can be completely invisible to ChatGPT, while a smaller competitor with better documentation and more mentions on Reddit or GitHub gets recommended instead.
That's a massive, silent redistribution of discovery -- and unlike the dot-com era, the companies being disrupted mostly don't even know it's happening yet.
The irony is that while AI slop is gaming Google, the AI models themselves (ChatGPT, Claude, Perplexity) are surprisingly resistant to it. We tested ~150 B2B tools and the correlation between Google rank and LLM recommendation was 0.08. Basically zero.
So you end up with two diverging discovery layers: Google, increasingly polluted by cheap AI content, and LLMs, which seem to weight structured data, documentation quality, and mentions on high-trust platforms (Reddit, GitHub, Stack Overflow) way more than traditional SEO signals.
The real question is whether Google fixes this before users just default to asking ChatGPT instead of searching.
I suspect almost everyone who is savvy is already using LLMs instead of Google, I have been for awhile. (If you only read the summary from google and don’t actually click on results links you are also using an LLM)
The explicit ads angle is only half the story. Even without paid placements, these models already have implicit recommendations baked in.
We ran queries across ChatGPT, Claude, and Perplexity asking for product recommendations in ~30 B2B categories. The overlap between what each model recommends is surprisingly low -- around 40% agreement on the top 5 picks for any given category. And the correlation with Google search rankings? About 0.08.
So we already have a world where which CRM or analytics tool gets recommended depends on which model someone happens to ask, and nobody -- not the models, not the brands, not the users -- has any transparency into why. That's arguably more dangerous than explicit ads, because at least with ads you know you're being sold to.
The JS rendering point is critical. Even though bots like GPTBot technically have headless capabilities, they often fall back to text-only extraction for non-priority pages to save compute. We see a lot of "invisible" content in e-com especially because of this.
One other signal to check: internal linking structure. AI crawlers seem to respect semantic clusters more than traditional pagerank flow. If your "about" page isn't semantically linked to your "product" page in a way the LLM understands as a relationship, it often hallucinates the connection.
Thanks for the detailed feedback. Those are the next items on my list now. Will add headless browser research capabilities to go around java script issues. Will also add semantic clustering check.
Seems like you are quite well versed with the space. Would you be open to sharing some interesting resources or getting on call with me to share if you have struggled with this problem and what your workflow looks like?
One angle thats underappreciated here is discovery. Even if the SoR survives as the backend, the interface layer is shifting to AI agents and assistants.
We've been looking at how LLMs actually recommend software tools (audited ~150 B2B SaaS across ChatGPT, Claude, Perplexity, Gemini) and there's a massive bias toward products with clean, parseable documentation and structured data. The "vibe-coded" replacements often win here not because they're better products, but because they're easier for an agent to read and integrate with than a legacy SoR hiding behind a login wall or a PDF.
If your product is invisible to the agent doing the recommending, the quality of your backend stops mattering. Has anyone else noticed this shift in how enterprise software gets discovered?
Interesting that you have "GEO ready" baked into the boilerplate. Curious what that covers specifically -- are you doing structured data / schema markup for LLM retrieval, or something more like llms.txt / meta tags aimed at AI crawlers?
We've been testing how different content structures affect whether AI assistants actually recommend a product, and the difference between well-structured docs vs not is pretty dramatic.
This is a great dataset. The 'cross-domain causality leap' is something we see constantly in brand monitoring—e.g. an LLM seeing a pricing page for 'Product A' and a feature list for 'Product B' and confidently asserting 'Product A has Feature B for $X'.
One edge case you might want to add: *Temporal Merging*. We often see models take a '2024 Roadmap' and a '2023 Release Note' and halluncinate that the roadmap features were released in 2023. It's valid syntax, valid entities, but impossible chronology.
Are you planning to expand this to RAG-specific failures (where the context retrieval causes the mix-up) or focusing purely on model-internal logic gaps?
That's a great example — the "Product A + Product B pricing merge"
is exactly the kind of structurally valid but impossible composition
I was trying to isolate.
I really like the "Temporal Merging" framing.
You're right: roadmap + release notes = syntactically consistent,
entity-valid, but chronologically impossible.
I haven't explicitly modeled temporal integrity yet,
but that seems like a natural extension of the cross-domain tests.
Regarding RAG:
So far the focus has been on model-internal structural logic gaps.
I haven't built retrieval-aware tests yet.
That said, I suspect many RAG failures are just amplified
cross-document merging errors, so a temporal integrity layer
might actually generalize well there.
If you have examples from brand monitoring contexts,
I'd love to add them as new regression cases.
Distinguishing 'AI Research' (crawling) from 'AI Referral' (user clicks) is the hardest part. Most agents (OAI-SearchBot, ClaudeBot) declare themselves in UA, but the actual click-through often strips the referrer or shows as direct/none. We've had some luck correlating 'time of crawl' with 'time of visit' to fingerprint AI traffic, but it's noisy.
Self-hosted is definitely the way to go for raw logs though. GA4 obfuscates too much of this.
Very cool to see a local-first approach to agent memory. The shift away from hosted vector DBs for single-agent use cases makes total sense.
Are you using `sqlite-vec` under the hood for the embeddings or a custom extension? Also, curious how you handle the 'hybrid' search part—is it just a linear combination of FTS5 bm25 and vector cosine similarity, or do you have a re-ranking step?
I've been monitoring server logs across ~150 sites and the pattern is striking: AI crawler traffic increased roughly 8x in the last 12 months, but most site owners have no idea because it doesn't show up in analytics. The bots read everything, respect robots.txt maybe 60% of the time, and the content they index directly shapes what ChatGPT or Perplexity recommends to users.
The irony is that robots.txt was designed for a world where crawling meant indexing for search results. Now crawling means training data and real-time retrieval for AI answers. Completely different power dynamic and most robots.txt files haven't adapted.