fuzzy hashing and a database? Break the content into parts, and generate scores for each part. calculate a hash entry for each part and enter into a database. Compare against existing entries. If the content is too similar, then reject. Doesnt get you around spam, though.