The original heuristic is considered H(3,1) by this definition. Then we can play around with a and b to see if we'd get smaller results.
def findregex_lambda(winners, losers, a, b):
"Find a regex that matches all winners but no losers (sets of strings)."
# Make a pool of candidate components, then pick from them to cover winners.
# On each iteration, add the best component to 'cover'; finally disjoin them together.
pool = candidate_components(winners, losers)
cover = []
while winners:
best = max(pool, key=lambda c: a*len(matches(c, winners)) - b*len(c))
cover.append(best)
pool.remove(best)
winners = winners - matches(best, winners)
return '|'.join(cover)
>>> findregex_lambda(starwars, startrek, 3, 1)
' T|E.P| N'
>>> findregex_lambda(starwars, startrek, 3, 2)
' T|B| N| M'
Or, to automate this:
def best_H_heuristic(winners, losers):
d = {(a,b) : len(findregex_lambda(winners, losers, a,b)) for a in range(0,4) for b in range(0,4)}
return min(d, key=d.get)
>>> best_H_heuristic(starwars, startrek):
(3,1)
Looks like H(3,1) is pretty good for this case. What about the nfl teams?
Not the best heuristic there. H(3,1) wins or ties for the boys/girls set, left/right set and drugs/cities set, which just goes to show you that picking a heuristic off a gut guess isn't such a bad approach.
You could also explore heuristics of different forms:
1. The greedy algorithm has an O(log(n)) approximation ratio, meaning it produces a regex guaranteed to use a number of terms within a multiplicative O(log(n)) factor of the optimal regex.
2. Unless P != NP, set cover cannot be approximated better than the greedy algorithm. In other words, the only general solutions you'll find (unless you're using some special insight about how regular expressions cover sets of strings) will be no better than a constant factor improvement in produced regex size than the greedy algorithm.
That being said, regexes (esp disjunctions of small regexes) are not arbitrary sets. So this problem is a subset of set cover, and certainly may have efficient exact solutions.
Yep, he originally threw out all names that appeared on both lists, while he's now arbitrarily assigning them to the winners' set. Not sure about the reasoning behind the change, but there it is. The quoted line computed new winning and losing sets that excluded all commonalities.
I actually also really liked that line, it helped flesh the article out to be an impromptu tutorial on using sets in python (which it still is, to a certain extent). I remember reading that sets overloaded some sensible operators a while back, but I really appreciated this reminder.
In other words, very nice of you to highlight it in a comment, as new readers wouldn't have seen it otherwise (as it turns out).
Most times when I read Norvig's Python I get the same feeling as when a tough riddle answer is given, except with Norvig you don't wonder whether anyone actually arrives at the answer the first time they hear it, you know he did.
I thought it was going to be meta-meta-regex golf, and couldn't imagine how that would be possible. But meta-regex golf is an interesting exercise, and is far more tractable. :)
So long as the unrelated arbitrary program is of finite length, and the regex golfer / arbitrary program composite is running on a machine with a finite amount of state, the halting problem can be solved by running each program for 2^(bits of state) steps or until a state is repeated.
So as long as you can show that a program that is not finite in length will not halt in a finite amount of time (I think this is the case, but am not actually positive), I think that a rexex-finding-program-finding-regex should be possible to write, at least in principle (though not in practice).
Yes, the halting problem can be solved for any programs which can only exist in a finite number of states, but most common definitions of a program would include things such turning machines, which can hold an infinite number of states. Really it comes down to the semantics and bounds of the problem.
However I think the most interesting result is mine :P
Of course it is required (within the computational limits of your computer, of course)
If p(x) means 'x is an integer', you could use the regex /^1$/ since it matches integers only (in particular, the integer 1). That's not sufficient, you need to match all integers.
Almost any regex containing a common English phrase of a few words, like /of course/, would give you a very high accuracy rate on a large enough dataset and a very low false positive rate. English has low entropy per bit.
That strategy would return results for any phrase not containing "of course". Which would be rather a lot of results in any document. I am envisioning a regex that plucks hashes from emails, for example. Some parts English, some parts cryptographic "noise". Pick the noise from the English
(note the inspiration was a nefarious regex scanner for finding bitcoin hashes. I have no intention of building such a thing, but the idea of a random detector regex intrigued me. Is it possible?)
My idea was more like calculate the statistics of character n-grams in english (all of which exist in random noise too), but count the most unlikely occurrences until you hit some probabilistic threshold that indicates some well thought out decision boundary.
I had to recognize English for a cryptography project in college. This was my entire strategy, which the professor advised for rejecting non-English (it turns out recognizing non-English is easier than recognizing English), and works well:
1. Parse the full text of some long book available on Project Gutenberg. Record every trigram that occurs. Frequency is irrelevant; we just want to know whether a trigram occurs or not.
2. Go through your text, counting the number of trigrams that didn't exist in the sample text. If the number of strange trigrams exceeds a threshold (for my project I used a threshold of 3, but you can tune this), reject the text as non-English.
Given the constant finite threshold I used, that does amount to a regex, but I don't recommend trying to write it out explicitly.
If you're after hashes you could just match a string of hexadecimal digits of a specific length (such as [0-9a-f]{16} ); much easier to define than English text.
For detecting hashes? As in, SHA, or {key: value}?
If the former, you're possibly better just looking for high-entropy chunks of data of the right size. Of course, that'll match all kinds of encrypted data, but there may not be much you can do there.
It means to use as the key for max 3(number of things this pattern matches) - (number of characters in pattern). The basic idea is that patterns that match more things are good, while patterns that are long are bad; if you select the pattern with the maximal score for the above expression it's matching more things than others or is shorter than others, or both.
Why the "3" bit, good question. Would be interesting to see what happens with other relative weights of number of matches and length.
Norvig did say why he chose 3:
" I may have chosen a bad tradeoff. (I arbitrarily decided that matching a winner is 3 times more important than spending a character (because a disjunction seems to take about 3 characters on average).)"
IMHO, sadly, that part is serving as a shibboleth.
The clue is how he explains all the simple parts of the script with comments, but leaves this part unexplained, making some of us feel left out of the cool club. Implying: if you don't grasp it immediately you must not be up to it.
Overall it's a great post, but as to the cryptic parts of it, it's disappointing to see this kind of thing on the part of someone we all look up to. It would be more generous of spirit if he were to have written this in a readable, self explanatory way, like the rest of the code... as python should be written.
Sigh. Norvig has a posse, but I thought it had to be said.
>>> import this
...
"Readability matters"
...
"Sparse is better than dense"
...
"If the implementation is hard to explain, it's a bad idea"
A line of code that isn't as clear as it could be in an ipython notebook he probably dashed off in an hour is somehow a reflection on his generosity of spirit? I think you need to recheck your expectations.
A minor correction: the explanation for this heuristic was added later. (I first opened this in a tab, and then didn't read the tab till much later. I also wondered about the heuristic and didn't think to hit reload till I saw your reply.)
But you're right, there is certainly nothing ungenerous from Norvig here. Quite the opposite.
trivially fixable by including start-of-string and end-of-string as tokens in the initial string breakdown, so instead of analyzing {abc, ab, bc, a, b, c} as candidate regexes, you start out analyzing {^abc, abc$, ^ab, abc, bc$, ^a, ab, bc, c$, ^, a, b, c, $}; would rapidly home in on c$ as an optimal solution.
Thus would fail against a 'must fail' target of abcabc, but then you fix that by extending the maximum allowable regex fragment length from 4 to 5 and it'll find ^abc$. More generally, you extend the maximum allowable regex count to the longest 'must match string' plus 2, and it'll always succeed, even if it has to create a regex consisting of ^word1$|^word2$|^word3$...
Given that, an interesting variation of the problem is: what's the easiest way to transform an expression to incorporate new data? After the next election, one can toss away the result and re-generate a new one from scratch. But is there an easier way to absorb additional terms, both inclusions and exclusions?
It's too bad he didn't try to tackle the optimal regexp problem and settled for approximations - it may be a NP-hard problem, but all the example solutions are short enough that the instances might be still tractable. Would've been nice to know for sure.
If you just want to play regex golf this site appeared before Christmas and there was quite a discussion [1] although there are a few more levels now: http://regex.alf.nu/
I'm still not happy with my 214 on Alphabetical including one false match (I was 202 or something with everything correctly matched).
By "this json file", do you mean the ipython notebook itself? (It's saved as JSON behind-the-scenes. That's what you'll get if you click "download notebook".)
If so, yes, that's what a saved ipython notebook is. See: http://ipython.org/notebook.html for an overview of ipython notebooks.
What's the best way to install ipython on a Mac? Mine's from Macports, and I have ipython notebook, but when I try to use nbconvert I'm told I'm missing pandoc. A 'port install pandoc' wants to install yet another copy of gcc (I have two--including one from Macports for another ipython dependency), ghc, and about three dozen packages prefixed with 'hs'.
Is there any nice bundle of everything you need for ipython, for the Mac, that's both easy to install and uninstall?
Seeing as how you already have a functioning install of ipython and a setup you're otherwise happy with, this is probably the easiest route.
Also, just so it's clear what's going on, ipython notebooks and (especially) format conversion are optional (but very useful) parts of ipython. They're not part of the core functionality, so `pandoc` isn't a strict requirement to build ipython.
Otherwise, as long as you don't mind switching the python interpreter, etc as well, look into Anaconda or Canopy. I'm not 100% sure if either comes with pandoc, but they're stand-alone scientific python distributions that make it a good bit easier to get ipython, numpy, matplotlib, etc set up.
Thanks for the pandoc link, I'll try that tonight. My ipython does already have notebook, numpy, matplotlib, and more set-up; I didn't realize I was missing anything till I tried nbconvert.
But I've just recently installed all this, and have only kicked it around a little. Mostly, I wanted plotting.
Performance. For example if you are syntax highlighting a programming language with hundreds of keywords, then using a regexp like "(kwd1|kwd2|...|kwdn)" is not very efficient. An optimized regexp can do the same matching much faster.
You would be much better off just generating the DFA of the straightforward regex and minimizing that DFA. This is simpler and faster to generate and faster to execute. Furthermore, in a parser for a programming language you do not want to match some things in one list but not in this other list, you want to match some things in this list and nothing else.
I know this isn't what you're asking, but I imagine in this case it's because that's one of the challenges of regex golf. Matching regex, as short as possible.
Judging by the amount of fawning here, this guy must be a HN celebrity. Interesting post nevertheless!
I can only hope, one day, I'd be writing and publishing joyful little hacks like this, to such general applause, instead of eking out a living. I have to say I'm a bit envious here!
Well done to the dude. An inspirational post, in many ways.
You could also explore heuristics of different forms:
Or trying completely different formats: