>I'm rather impressed that search engines do it so well. I imagine the right approach involves examining the contents of the pages and doing checksums
Interesting point. Q: Is it possible for two different pages to give the same checksum? (Asking for my own info; I do know what checksums are, in overview, from way back, but haven't checked :) them out much.
Given two different pages getting the same checksum should be VERY unlikely. Getting a collision among billions of pages is fairly or very likely depending on the type of checksum. So it is likely to be important to only compare within domains or divide the space by some method.
Also it doesn't help you at all because pages returned that you want to treat as the same aren't actually the same because they include the request time or other unique element.
Edit: Added sentence about comparing with domains.
Interesting point. Q: Is it possible for two different pages to give the same checksum? (Asking for my own info; I do know what checksums are, in overview, from way back, but haven't checked :) them out much.