Hacker News new | past | comments | ask | show | jobs | submit login

Almost any regex containing a common English phrase of a few words, like /of course/, would give you a very high accuracy rate on a large enough dataset and a very low false positive rate. English has low entropy per bit.



That strategy would return results for any phrase not containing "of course". Which would be rather a lot of results in any document. I am envisioning a regex that plucks hashes from emails, for example. Some parts English, some parts cryptographic "noise". Pick the noise from the English

(note the inspiration was a nefarious regex scanner for finding bitcoin hashes. I have no intention of building such a thing, but the idea of a random detector regex intrigued me. Is it possible?)

My idea was more like calculate the statistics of character n-grams in english (all of which exist in random noise too), but count the most unlikely occurrences until you hit some probabilistic threshold that indicates some well thought out decision boundary.


I had to recognize English for a cryptography project in college. This was my entire strategy, which the professor advised for rejecting non-English (it turns out recognizing non-English is easier than recognizing English), and works well:

1. Parse the full text of some long book available on Project Gutenberg. Record every trigram that occurs. Frequency is irrelevant; we just want to know whether a trigram occurs or not.

2. Go through your text, counting the number of trigrams that didn't exist in the sample text. If the number of strange trigrams exceeds a threshold (for my project I used a threshold of 3, but you can tune this), reject the text as non-English.

Given the constant finite threshold I used, that does amount to a regex, but I don't recommend trying to write it out explicitly.


How effective was your algo?

Presumably it would false-negative texts with neologisms and spelling errors and such.

There must be trigrams which simply don't appear in dictionary-English? Do you have results posted somewhere.


If you're after hashes you could just match a string of hexadecimal digits of a specific length (such as [0-9a-f]{16} ); much easier to define than English text.


did not think of that!

might be interesting to see how that stacks up against the other idea thread.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: