HN2new | past | comments | ask | show | jobs | submitlogin

OK, email me at f234567a360f54c1d31a70936f336bc679ba4f9f (sha1sum of an email address with no trailing carriage return or line feed[1]) and I'll believe you.

1. e.g.

    $ echo -n billg@microsoft.com | sha1sum
    2517e4726f81e16f65eb95cf6446ad35352f566e  -


In general, the search space even for email addresses is probably too large for me to crack in a few days, but in the context above, where the author's email was already available online (on her website, in SPAM databases, in leaked credential datasets, ...), there is hardly any difference. In any case, if you consider my email address "personally identifiable information", I consider its checksum such information as well.


So have the customer hash her list with a salt, and you hash your list with the same salt, and everyone goes home for dinner.


> In any case, if you consider my email address "personally identifiable information", I consider its checksum such information as well.

I wonder what the odds are on a hash collision from another email address (including abusing + addressing) that genuinely belongs to another person (rather than just exists) and therefore the resulting hash does not uniquely identify a single person.


Very, very small.

The 'birthday attack'[0] article covers this pretty well, but if we take the output size of a SHA-1 hash as 160 bits, and assume it's outputs are equally distributed[1], a brute-force approach (equivalent to a non-maliciously generated accidental collision across all addresses ever) is:

    sqrt(2**160 * PI/2) ~= 1.5 x10**24
for there to be a 50% probability of a collision occurring. (if I understood/got the maths right)

[0] https://en.wikipedia.org/wiki/Birthday_attack [1] This is the intent of all hash functions, and I don't think there are any fundamental attributes of email addresses that would cause systematic bias in the output


To put things into perspective:

Approximately, 10^3 = 1000 ~= 1024 = 2^10, 10^2 = 100 ~= 128 = 2^7.

Assume you have 1 billion (10^9) computers, each computer can do 1 billion hashing operations per second. That is 10^18 operations per second combined.

Rounding up, one day has 1 million seconds (10^6), and one year has 1000 (10^3) days. So, we have 10^27 ~= 2^90 operations per year.

100 million years is 10^8 ~= 2^27. So, you have 2^117 operations in 100 million years. Geologically, there was an Extinction Event [1] about every 100 million years (e.g. 66, 200 and 251 million years ago). So, having an (unintentional) hash collision in more than 128 bits (assuming a good hash function that has uniformly distributed hash) is less likely than an event happening within the next second that kills 50% of the Earth's species.

[1] http://en.wikipedia.org/wiki/Extinction_event


I'm not willing to answer the challenge, but I definitely believe it could be done. If someone was willing to purchase a large list of harvested e-mail addresses and sha1sum them all, it is very likely a commonly used address would show up in it. Now, if the address you used above is actually some single-purpose address similar to what I use for all my online accounts, that would not work, but I believe that very few people use dynamic partial addresses in that way. Not even the simple ones that gmail provides.


If by "dynamic partial addresses" you mean "plus addressing" then, yes, it does use that.


> The document then says that in 2011 he sent an email to “hundreds of atheists” with a link to his website and that I had reported him for violating GoDaddy’s policies against spam.

Give it to me in a list along with "hundreds" of red-herrings (let's say < 10000), and sure, no problem.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: