HN2new | past | comments | ask | show | jobs | submitlogin

Given two different pages getting the same checksum should be VERY unlikely. Getting a collision among billions of pages is fairly or very likely depending on the type of checksum. So it is likely to be important to only compare within domains or divide the space by some method.

Also it doesn't help you at all because pages returned that you want to treat as the same aren't actually the same because they include the request time or other unique element.

Edit: Added sentence about comparing with domains.



You could use an edit distance algorithm instead of using checksums. Although that would be really time intensive.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: