Hacker News new | past | comments | ask | show | jobs | submit login
Detecting duplicate images with Python (iconfinder.com)
231 points by iconfinder on April 3, 2014 | hide | past | favorite | 51 comments



We are currently using perceptual hashes (e.g. phash.org) to do hundreds of thousands of image comparisons per day.

As mentioned in another comment, you really have to test different hashing algorithms to find one that suits your needs best. In general though, I think it is in most cases not necessary to develop an algorithm from scratch :)

For us, the much more challenging part was/is to develop a system that can find similar images quickly. If you are interested in things like that have a look at VP/MVP data structures (e.g. http://pnylab.com/pny/papers/vptree/vptree/, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.7...).


Indeed, the storage and retrieval of similar images is the hardest part. I do not know of a single networked open-source storage solution for this. I really wish that there was a project with a mindset of Redis, but for MVP trees. By the way, may it be possible to implement MVP data structure in Redis, as the project is now? I can not think of possible replication issues with this, apart from the fact that one would have to pre-define a metric space for every tree.

It could be a great extension to Redis DSL.


Yes, you're right. We're not using SQL queries at the moment as that would be very inefficient, it was just as an example for a small dataset.

I'm currently researching MVP's and reading on VP-trees, BK-trees [1], GNAT [2] and HEngine [3]. Do you have any advice?

[1] http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK...

[2] http://www.vldb.org/conf/1995/P574.PDF

[3] https://www.cse.msu.edu/~alexliu/publications/HammingQuery/H...


I think you are on the right track there.

The thing is though, you won't have difficulties finding papers on those topics. However, you will probably not have any luck finding many concrete and practical implementations that you could look at.

So it's a far way from reading the papers to having something working.

If you find something, please let me know.


There's a list of CBIR's on Wikipedia and ammong those there are a few open source ones. I didn't really had time to check them all but during skimming through them imgSeek [2] caught my eye.

[1] http://en.wikipedia.org/wiki/List_of_CBIR_engines

[2] http://www.imgseek.net/isk-daemon


Here is my question for OP. It seems like the image shrinking step coupled with the transformation to +/- values loses so much information that the hash would suffer from big false positive problem. I would have loved to see some data on this based on their own dataset.

To give a concrete example, I noticed recently that a lot of app store icons within a category look pretty similar. See for example this:

https://twitter.com/acslater00/status/450127865682489344/pho...

All of those checkmarks certainly seem to my naked eye like they the binary diff transformation would result in a very identical hash. It seems like the rounded corners in 'Finish' would blur out, and the texture in 'OmniFocus 2' would blur out, and the gradient in 'Clear' would look identical to a flat gradient on the right side of the checkmark.

Anyway, clever algorithm but curious how it works in practice on small icons?


First as a reply to the false positive problem: https://news.ycombinator.com/item?id=7527982

Idealy we would like to move to a content-based image retrieval system where we would be able to search based on certain features that can be derived from the image itself (color, shape, texture for example) so we could fine tune our results.

Yes, the presented example is a curious case, if we take the first three icons there and compare them based on share and color we can see that their shape is identical but the background is different. Based on this, should we consider different or identical? You can't have too many variations on a simple shape like a checkmark or a facebook logo so what variations should you allow and which ones would you consider as copying previous work?


dHash seems like a fast algorithm. So, one could search for potential collisions quickly with dHash, and then run a more complex and expensive algorithm on those matches to refine the results.


While this algorithm could suffer from a large false positive problem, that issue could also work to his advantage when implemented as a "find similar images to this", which was addressed later in the article.


I am more concerned here with false negatives due to cropping, resizing, rotating, etc.


Or you can use OpenCV with SIFT/SURF/ORB + KNN/RANSAC and have a very robust solution. Example [1].

OpenCV has awesome Python bindings (cv2), btw.

[1] http://stackoverflow.com/questions/2146542/opencv-surf-how-t...


Using feature detectors and descriptors is only half of the solution. If you really want robust image recognition you need to use something like the vocabulary tree developed by Nister[1][2].

[1]http://www.wisdom.weizmann.ac.il/~bagon/CVspring07/files/sca... [2]http://www.cc.gatech.edu/~phlosoft/files/schindler07cvpr2.pd...


True, however, OpenCV implements a bag of visual words, too [1]

[1] http://docs.opencv.org/modules/features2d/doc/object_categor...


This is a bad approach for a couple reasons: 1) The total measurable space/information over a resized icon-size greyscale image is pretty small, so you run into a much higher likelyhood of collisions/false positives.

2) It's not too hard to program a Haar wavelet[1] transform[2] (basically iterative scaled thesholding). This has worked well over at IQDB[3], where they do reverse image lookups on databases of over several million images via a modified Haar wavelet.

You can't beat this algorithm for simplicity, though. Have you guys done any false positive checks with this algorithm? The saving grace might be that icons are fairly small and limited in detail/color.

[1] http://en.wikipedia.org/wiki/Haar_wavelet

[2] http://stackoverflow.com/questions/1034900/near-duplicate-im...

[3] http://iqdb.org/


I feel like this is doing something that was already done better (probably), but I enjoyed the article anyways. Nicely summarized what, why, and how.


Any idea of what might be a better approach?


There is no universal solution, there are a few algorithms like SURF [1], SIFT [2], Winsurf [3], dhash (the one presented here), ahash [4], phash [5] each concentrating on a particular feature of the image and has it's strong points and weak points.

[1] http://en.wikipedia.org/wiki/Scale-invariant_feature_transfo...

[2] http://en.wikipedia.org/wiki/SURF

[3] http://www-db.deis.unibo.it/research/papers/IWOSS99.ABP.pdf

[4] http://www.hackerfactor.com/blog/index.php?/archives/529-Kin...

[5] http://www.hackerfactor.com/blog/index.php?/archives/529-Kin...


Many of the steps are similar to ideas used in the scale-invariant feature transform [1]. See also the original paper [2] and an early use for automatic image stitching [3].

[1] http://en.wikipedia.org/wiki/Scale-invariant_feature_transfo...

[2] http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf

[3] http://www.cs.bath.ac.uk/brown/papers/ijcv2007.pdf


A comment in the article mentions libpuzzle[1], which looks interesting. A Python binding also exists: PyPuzzle[2].

[1] http://www.pureftpd.org/project/libpuzzle

[2] https://github.com/ArchangelSDY/PyPuzzle


Googling "perceptual hash" yields http://www.phash.org/


If the has were really looked up by value, like in the example SQL:

   SELECT pk, hash, file_path FROM image_hashes
       WHERE hash = '4c8e3366c275650f';
This is screaming to be stored in a bloom filter.


Several years ago I tried matching images on histogram comparisons, I realized it wasn't going to be a perfect match, but actually had a pretty high degree of success on my data set (game tiles).


I can't work out why the width would be need to be 1px bigger in width? I don't see the explanation in #3. Also, the array in #3 shows a 9x9 grid of 81 values when there are only 72 pixels?


If you only have 8 samples per row, then you have 7 adjacent pairs. If you use a bit to represent the differences, you'll only have 7 bits. If you want 8 bits, you need 8 differences, so 9 samples.


That makes sense, but why does the example have a 9 by 9 grid, would it be a 9 columns but only 8 rows?


Yes, you are right, it's a mistake in the text. I'll correct it.


Great write-up! We did something very similar when trying to find duplicate product images for a consumer review site we were working on. Our implementation desaturated the image, broke it into a fixed number of tiles, and generated a histogram of each tile's values. Then we put a threshold on the histogram values, and had each value in the tile represent a bit. Combine the bits, and we had a hash to store in the DB. Our hashes were a bit larger, so images within a certain hamming distance were flagged, rather than just looking for exact hash matches. It took quite a bit of tuning for us to get good results, but it seemed to work pretty well. Do you see many false positives with such a small processed image size (the 9x8 one, I mean)?


In practice we are using an image size of 17x16 which will result in a hash size of 256 bits and currently it seems to work pretty well. I ran the algorithm through the whole dataset (about 330.000+ icons) and I would say that from all the duplicate matches about 1% where false positives.

Also, we will be integrating this into the reviewing process for an iconset, where we also do a manual quality check, showing possible matches to something currently uploaded so skimming over one or two false positives isn't such a big deal and we where more interested in the speed of the algorithm.


That's pretty impressive performance given the hash size and speed. Thanks for sharing!


What about images which are mirror-flipped? Unless it is a symmetrical icon that particular algorithm will consider the two highly different.


An idea: This would require more information storage, but would it be possible to hash an image and take snapshots of the hashing algorithm as it processes the image, say after each block of hashing (hashing digests a block - such as 64 bytes - at a time). Then simply compare the list of snapshots between two images and come up with a statistical threshold for a "similar image"? In the case of the two cats the images only differ in the nose, so the first half of the image up to the nose would produce the same list of snapshots.

You could also hash forwards, backwards and starting at several random midpoints to prevent someone from simply changing the first block to throw off the hashing algorithm.


This is a very good article on images hashing. Thanks


Will the PIL import pick up WebP images as long as libwebp-dev is installed? I'm a WebP kinda guy and I love fuzzy math achievements like this, especially when I don't have to switch to Windows[1] to take advantage of it.

[1] http://www.visipics.info


I wonder if this algorithm would work if someone uploaded a rotated version of a previously uploaded icon. As it only compares row adjacent values, if the rotation was 90 or 270 degrees I'm not sure it'd detect the duplicate.


Be careful not to get sued like that guy who posted an article about detecting songs and got sued by Shazam, even though everybody knows about FFT (I've been doing something like that for my BSc)


Any idea if that article is still available? Or do you have any good resources for implementing FTT? I'm working on a project to compare sound clips for similarity, but I haven't grasped an effective way to accomplish that yet.


I just implemented math from a textbook. Obviously, you will need to break down your song into smaller chunks. Then compare the frequencies of these chunks. When sufficiently many chunks are "close" (you'll need to define some measure of similarity), then songs are "the same". P.S. That article has been taken down by the authors after repeated threats from Shazam.


FFTW


Oh cool...thanks jmmcd.


For Rubyists, check out the phashion gem, which is a light wrapper around pHash

https://github.com/westonplatter/phashion


I always wondered how partial duplicates on images were found. This wasn't as complicated as I was expecting. I assume some similar techniques are used for reverse image searching?


JPEG already has the low resolution information stored in an easily retrievable way. You could use that directly, no need to do the transforms. It would be a lot faster.


Only if it's progressively encoded (though the decoder can still do fast 1/2, 1/4, and 1/8 reductions). Also, low resolution data isn't the same as what you get from a scaling algorithm like ALTIALIAS (though I don't know what that does, probably something like Lanczos).

Plus, you still have to handle other formats like png, and get the same output from the same image.


Not really. JPEG encodes by 8x8 blocks, thus, for the DCT transform at the beginning, the first component after the transformation is always the mean value (of the 64 elements). Therefore, if your resizing target is more than 8x smaller, you can use the mean value directly without doing expensive DCT transform and the whole decoding process.


well, that's what I was hinting at by the fast 1/8 reduction, so yes you can speed up the process by "decoding" a smaller version of the jpeg image.

This is just an optimization to add later though. You might be able to quickly scale your jpeg down, but you can't skip the rest of the transforms, because you still need to get deterministic fingerprint of the image. And in the article's use case, the file size is so small that you shouldn't be dominated by decoding time (that test jpeg from the site decodes in 12ms on my laptop with libjpeg at full size).


Although slower than this solution, I like this program: http://ssdeep.sourceforge.net/


I've done a fair amount of work in image processing and computer vision, but haven't read much into hashing of images. Interesting article.


Node implementation -> https://github.com/rompetoto/dhash


This might be a stupid question - but what if you have two icons, one with the cat with a black nose, one with the cat with a gray nose?


I had the same thought.i think they are going to use this to flag similar icons(and to show recommended images).so, most likely it'll be reviewed by a human.

Edit:

From the last paragraph. > we will be using the algorithm in the future on Iconfinder to warn us if a submitted icon already exists in our collection but it can also have other practical uses. For example, because images with similar features have the hamming distance of their hashes close, it can also be used as a basic recommendation system where the recommended images are within a certain hashing distance of the current image.


The title should be: how to detect a duplicate image.

No need to say, in Java, or in Python. It can be done in any lang.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: