HN2new | past | comments | ask | show | jobs | submitlogin

>I'm rather impressed that search engines do it so well. I imagine the right approach involves examining the contents of the pages and doing checksums

Interesting point. Q: Is it possible for two different pages to give the same checksum? (Asking for my own info; I do know what checksums are, in overview, from way back, but haven't checked :) them out much.



Given two different pages getting the same checksum should be VERY unlikely. Getting a collision among billions of pages is fairly or very likely depending on the type of checksum. So it is likely to be important to only compare within domains or divide the space by some method.

Also it doesn't help you at all because pages returned that you want to treat as the same aren't actually the same because they include the request time or other unique element.

Edit: Added sentence about comparing with domains.


You could use an edit distance algorithm instead of using checksums. Although that would be really time intensive.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: