HN2new | past | comments | ask | show | jobs | submitlogin

>How about keeping a count of links encountered and use highest to decide what to crawl next

Then you have a giant heap with several billion elements that you're going to be updating 100s of entries on each webpage you crawl. I'm pretty sure that'll be your new bottleneck.

Besides, it still doesn't solve the problem of content farms and spam.

>set a max pages to crawl per site to prevent never ending single site crawls?

You either set the limit too low and don't crawl, for example, all of wikipedia, or you set the limit too high and still waste tons of resources.



Set the limit as a function of incoming links - Wikipedia will get a high limit, spam site with low incoming links from the current crawl will get a lower limit.

It's probably best to add an IP based limit factor too to handle link farms.

Something like "max-pages crawled is directly proportional to the (incoming links / linking unique domains)"; or just have it naively proportional to the linking unique domains to reduce processing cost.


> Set the limit as a function of incoming links - Wikipedia will get a high limit, spam site with low incoming links from the current crawl will get a lower limit.

There are span sites with more incoming links than Wikipedia....

Designing crawl policy is a lot like designing security policy. The opposition is very well-funded, smart, experienced, and very motivated.

If you're just smart, they've got you beat.


Can you give examples of spam sites with more inbound linking domains per IP than wikipedia? Have you crawled this info yourself or is it from something like linkscape?

Google apparently solve this with a domain trust factor discounting link juice from low value domains for SERPs purposes but I'm not sure what they feed forward to control their bots.

This means that for some time it has appeared that large numbers of low quality inbound links has been a negative for Google SERP placement ... so I'm wondering if such [spam] domain owners are being as smart as you think??


> Can you give examples

It's been several years since I worked on Yahoo's crawler.

It's interesting that at the beginning of this, you doubted that spam domains could have large number of inbound links (and suggested that inbound link counts was a powerful indicator) but by the end, you seemed aware of such domains....


>by the end, you seemed aware of such domains

I'm not. I've never done any research on "spam domains". Do you have any examples or not?


> I'm not.

Then who wrote

> This means that for some time it has appeared that large numbers of low quality inbound links has been a negative for Google SERP placement

?


Google use a trust value to rank the domains from which links come to a domain. I don't see what that says about the spaminess of the domain receiving the links. Good quality websites receive a lot of links and what I'm saying here is that it's important when garnering links to ensure that those links are from healthy domains. Inbound links appears still to be one of the most important ranking metrics but there are domains from which one wouldn't wish to be fostering inbound links.


You claimed that bad sites don't have lots of incoming links. Then you wrote about google's counter-measures against bad sites with lots of incoming links....


You're reading in what's not there. Yes, I noted that Google have countermeasures to address sites being more highly ranked than they should be because, whilst they have a lot of inbound links, the inbound links are of low quality (from low trust domains).

But none of that presupposes any knowledge of "spam sites", nor of the particular instances you note of spam sites with lots of inbound links (ie the claimed far more than wikipedia [per IP address]).

But it seems you don't have any examples so I'm not sure why you're persisting. Even if I did know about specific spam sites with lots of incoming links I don't see the relevance of my knowing that to you answering the question of if you can give any examples of sites with your claimed characteristics. The points are orthogonal. Your pre-knowledge of such sites is not in any way bound by my knowledge of such sites.

I daren't ask if you can answer the question again. But just suppose you could then a response would still be of interest.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: