Hacker News new | past | comments | ask | show | jobs | submit login
The absolute bare minimum every programmer should know about regular expressions (immike.net)
167 points by rayvega on July 20, 2010 | hide | past | favorite | 55 comments



Slight terminological quibble:

Regular expressions are strings formatted using a special pattern notation that allow you to describe and parse text.

"Parse" is a questionable word choice. A proper Regular Expression can only describe patterns found in "regular languages", though modern backtracking regex engines have a bit more power. Parsing implies a parser, which means matching a grammar somewhat more sophisticated than a regex engine can handle.


I don't agree, technically. A parser is a program which analyses text and determines its grammatical structure according to a formal grammar, and tests its membership in the language described by the grammar.

Regular languages can be described by regular grammars, and regular expressions are a notation for regular grammars. So you can indeed parse text with reference to a regular expression.

Most people would call a parser which only used strictly regular expressions a lexical analyser, lexer or tokenizer, but this is only a convention to simplify language definition and parser implementation when using simple techniques like LALR or LL.

But other uses may not need that level of sophistication. For example, a packet analyser in a router might use a dynamically updated state machine (a DFA), built from multiple regular expressions, for efficient dispatching of packets based on rules. I don't think it would be wrong to describe that component of the router as a packet parser.

For example, this (to me) blindingly obvious patent seems to be using a modified DFA to make routing decisions, and calls it a parser:

http://www.freepatentsonline.com/7623468.html


Technically, you're probably right. There's no hard-and-fast definition of "to parse" that requires the grammar to be particularly sophisticated. It is still not the best word choice because "parse" has a particular connotation, though.


Agreed. I cringed when I read that line. Parsers can do things that regular expressions can't (paren matching, for example). Yet I've seen people try using regular expressions when a parser would be a better choice.

There seems to be this emerging pattern in my career where I stumble across all the people in the world who have decided to try and validate HTML using regular expressions.


> try and validate HTML using regular expressions

http://stackoverflow.com/questions/1732348/regex-match-open-...


OK. I take it back. Any problems that I have ever had in my life have now been put into perspective.

That poor bastard.


I don't think there is anything in "parse" that implies relation to a specific type-N grammar, why do you think so? (the oxford american heritage dictionary seem to agree, but I'm not a native speaker)


Regular expressions in modern programming languages and the regular expressions you learned about in your automata theory class are not the same thing. The back references present in the regular expressions discussed in the article make the entire system capable of handling even NP-complete problems. So yes, regular expressions (in the more common use of the term) can "parse".


The back references present in the regular expressions discussed in the article make the entire system capable of handling even NP-complete problems.

Really? So, we can parse HTML with regular expressions nowadays?

I think a more appropriate word choice would be "recognize" instead of "parse", with regard to the OP of the thread.


I did allude to that, but regular expressions are still not quite up to the tasks you would use a parser for.


I can only hope this challenge inspires a spiteful hacker to create a full grammar parser in a regular expression. It won't be me, but I love how these types of comments can sometimes spawn the most interesting toy projects. :)


You may enjoy this http://www.a-k-r.org/abnf/


Considering he listed just about everything I can remember about regexes without google, and nothing more, I believe I'm duty bound to upvote.

My only quibble: I'd promote the character classes (especially \s and maybe \S) into this list from the "expert" category. Here's part2 (or what I call: stuff I can google later, if I need it), if anyone wants to keep reading: http://immike.net/blog/2007/06/21/extreme-regex-foo-what-you...


(Off topic, but that same character you randomly dump into irc messages appeared at the beginning of this comment)


That appears to be a backspace.


The title makes me wonder... I was an okay programmer before I learned about regexes. When the need arose, I learned how to use them easily. Should I have learned them before that?

Many programmers seem to think that they possess just the right amount of knowledge. Someone who knows Unix utilities but not algorithm design will argue that Unix utilities are vital, but algorithm design is a random obscure topic; someone with the opposite skillset will defend it just as eloquently. I think programmers have no obligation to know anything. If you can do the job, I don't care how much you use Google.

I'm spoiled enough to think that boring things shouldn't be deliberately memorized. It's enough for me to know the mathematical concept of a regular language, which lets me recognize situations where regexes would help. For concrete syntax we always have cheatsheets. This attitude has its advantages: for example, I never needed to be explained why parsing XML with regexes is a bad idea.


I agree that boring things shouldn't be memorized, and that being familiar with the concept of a regular language helps identify where a regular expression could help.

Two points:

+ A lot of programmers aren't familiar with the concept of a regular language, especially those that are hobbyists, or who went to shitty schools. Sad but true.

+ Regular expressions are useful commonly enough that I would be very surprised if an okay programmer with any sort of experience under their belt couldn't remember at least [] . * ? + .

I definitely agree with your main point. There is, however, a weird subjective layer of knowledge that tends to be present in developers of any experience. That doesn't mean that I would look down on an individual programmer who claimed experience that didn't know how to make a regex character class off the top of their head, but I would be a bit surprised.


I agree. When you're planning to invade Iraq, you need people who shoot well, and you need people who have been to Iraq before. They don't have to be the same people, though.

We had a great experience at one of our contract jobs: we found a smart math student with no webdev experience, and he did most of the programming while I constantly consulted him on "platform issues" and jumped in to code any tricky parts he had problems with. He became a competent webdev in about two months - way faster than it took me to learn this stuff by myself - then I handed the project over to him and left. I wonder why other employers don't do this more: mix experienced programmers with smart newbies who come from hard-science backgrounds. Seems to be a winning combination.


It's very disappointing to find the author didn't mention greedy matchers. Consider the difference:

  >> s = "hello world, I love you world"
  >> s.sub(/(.*?)world/, '\1universe')
  => "hello universe, I love you world"
  >> s.sub(/(.*)world/, '\1universe')
  => "hello world, I love you universe"


Those are in the next article in the series, which is linked from the bottom of this one: http://immike.net/blog/2007/06/21/extreme-regex-foo-what-you....


I prefer fishbowl's law

   Every regexp that you apply to a particular block of text reduces 
   the applicability of regular expressions by an order of magnitude.
- http://fishbowl.pastiche.org/2003/08/18/beware_regular_expre...

The minimum any programmer needs to know about regexps is when they are applicable, and more importantly, when they are not.


Seeing how im advocating that everyone should know when to be using parsing tools rather than cobbling together a regexp cludge, here are some links to some relevant tools in a variety of languages. Note that these are not parser generator tools, but parser combinator libraries:

    * Python: 
        PyParsing - http://pyparsing.wikispaces.com/ - this   
        is probably the most well respected parser library 
        for python 
        
        picoparse - http://github.com/brehaut/picoparse - My 
        parsec inspired parser combinator library. It is 
        significantly more light weight than pyparsing, and
        I would recommend pyparsing in its place for more 
        heavy lifting parsers.

    * Ruby:
        RParsec - http://rubyforge.org/projects/rparsec/ - A 
        ruby implementation of Parsec. I've heard good 
        things about it.

    * Haskell: 
        Parsec - http://legacy.cs.uu.nl/daan/parsec.html - 
        This is an amazing library, worth anyones time. 
        http://book.realworldhaskell.org/read/using-parsec.html 
        http://www.haskell.org/haskellwiki/Parsec

    * F#:
        FParsec - http://www.quanttec.com/fparsec/ - This is 
        reason enough to learn F#; it is a fairly true port 
        of the haskell library and very nice to use. 

    * Scala: 
        Packrat Parsing - I believe this is part of the 
        standard lib. http://www.scala-lang.org/docu/files/api/scala/util/parsing/combinator/PackratParsers.html 

    * Clojure: 
        fnparse - http://github.com/joshua-choi/fnparse

    * Java: 
        jparsec - http://jparsec.codehaus.org/
[edits: formating]


I think this actually illustrates the other guy's point. If you want a parser, you need to go download code off the internet. If you want a Regex, you type something along the lines of new Regex().


http://research.microsoft.com/en-us/um/people/daan/parsec.ht... "Nowadays, Parsec is distributed with the standard Haskell libraries with most Haskell compilers (GHC, NHC, Hugs)."


Treetop for Ruby looks pretty cool too: http://treetop.rubyforge.org/


Your link advocates writing parsers in situations where regexes aren't powerful enough, which is quite sensible and even quite practical in some cases.

But there's a big reason regexes are overused compared to parsers--regex engines come stock in most programming languages, and parser generators don't. There's a big mismatch in complexity between overusing something the language gives you for free and building in a parser. Perl 6 deserves some credit for their "rules" system, which can actually take a grammar and parse from it.


Agreed, the impedance of parsers has traditionally been much higher than for regexps. However, a lot of languages now have some sort of parse combinator library which dramatically increases the ease of writing a parser.

I'm thinking specifically of the Parsec inspired family, which has implementations or derivations in many common languages, even C# and Java (despite being noticably more cumbersome in those cases there).

[I cannot speak for perl 6, but i trust you are correct]


It's quite a good guide but I had thought it was going to contain warnings like "Don't use regular expressions to parse XML" and then explain why.


In Python, you can get the regex parse tree to help you debug your regex expressions: http://stackoverflow.com/questions/101268/hidden-features-of...


I learned something new - disjunction is sometimes called "alternation":

http://www.britannica.com/EBchecked/topic/165607/disjunction


One thing everyone should know about regular expressions is that the things called "regular expressions" provided by libpcre (which, of course, stands for "Perl compatible regular expressions") aren't regular expressions at all. Since they support backtracking, they provide a more powerful parsing framework than regular expressions alone (which don't support backtracking). That's how the (in)famous regular expression that recognizes prime numbers can work, even though the language of prime numbers is not regular.


For some reason, I've never been able to get the knack of regular expressions. This is the best introduction to them that I've seen, though I wish I'd read it years ago. However, if you've tinkered with regex at all ever, which I suspect virtually everyone here has, you'll know this. (I know it, and I hate and avoid regular expressions and am not even a programmer.)


There a couple of good programs for testing regex-es live. My favourite is Regex Coach by the awesome Edi Weitz:

http://weitz.de/regex-coach/ - It's Windows only though

There are many others, but his one thought me quite a lot


vim can be a powerful tool for visualizing regular expressions as well. paste in the text and set incsearch and set hlsearch to match on partial expressions and highlight matches. this allows you to distinctly see the effects of a regex as you type it.

also, python's regex module can show you how it interprets a regular expression by passing the re.DEBUG flag to re.compile:

  >>> import re
  >>> re.compile('a*b+', re.DEBUG)
  max_repeat 0 65535
    literal 97
  max_repeat 1 65535
    literal 98
  <_sre.SRE_Pattern object at 0x100460230>


It's weird, because they've always just clicked for me. I mean, you have text and you compare it to a pattern. The regex is just a crazy shorthand for the pattern you're looking for. What pattern should you match? Well, first write down examples of what you want to match and see what they have in common.

I think it's easier to get if you match a pattern to some text by hand, taking a simple pattern like ^a+bc$ and trying to see if it matches:

Here are a few test cases:

"aaaaaaaaabbbc" ?

1) ^a+ means it starts with 1 or more 'a's, so those eat all my 'a's and I'm matching bbbcc against 'bc' now 2) b* means 0 or more 'b's; those eat up everything but the c at the end. 3) c$ matches exactly one c at the end, so yeah, we have a match. If there had been one more c (or anything else), the match would fail because it wouldn't hit the $ (which is end of string).

aaac ?

1) ^a+ eats all the 'a's again. 2) b* ... what b? It's "zero or more" so yes, we can skip it. 3) c$ Yes, there's one c at the end. We have a match.

bc ?

1) We need one* or more 'a's at the start. That's the difference between + and . Match fails immediately.

acbbc ?

1) ^a+ matches the a. 2) b means it's optional, so we don't fail... yet. But we don't match anything, either. 3) c$ That c matches, but we're not at the end of the string yet. There's still "bbc" left to match and we're out of pattern. Fail.

You can do something similar with the tools other people have linked in comments. This was just to give an example pattern and a few test cases to get you started.

It's weird, though. Even though this clicks for me, I don't feel like I get monads at all. Yeah, I've read the 3 monad laws (great... so something being a monad means that I can monad it, un-monad it and do something vaguely similar to composition on monads) and I can do some very simple Lisp which uses built-in stuff they say is built from monads (like printing). But I don't know how they work internally at all, or how they're fundamentally different from all the other Lisp statements ("causes side affects" isn't terribly specific... why is printing to the screen a side effect? Is setq a monad?).

In other words, there's probably some topic for every programmer that they really have to struggle with. As easily as regexes come to me, I struggle to understand other stuff.

I've been at it for quite a while, incidentally. But I'm not going to give up any time soon. So keep working on it.


I wrote some stuff about monads on HN a while ago that some people found helpful: http://news.ycombinator.com/item?id=247446


Thanks. I just read it.

It does help, but I think there's still a lot I don't know. I mean, 'bind' and 'return' still seem like the names are backwards to me. Somehow, I feel like 'return' should give me a value, rather than shoving a value into a monad. But that's probably because I'm used to C and its relatives.

Perhaps I simply need to read up more on FP itself. I have a huge bookshelf full of half-read books (and yes, plenty of fully-read books) but nowhere near enough free time...


Oh. Yeah. The names are poorly chosen, but we're stuck with them. :(


HN ate some stars in the above post and I didn't notice until now. Sorry about that. Look for the italics to see where the stars should be.


I suspect your eyes tend to glaze over because of the many strange symbols? Just relax - they are not actually hard, it looks more complicated than it is. That there used to be a whole O'Reilly book dedicated to them also made them look more complicated. But there are books for everything (probably even Notepad).


When writing regular expressions I always use this website to test them out.

http://www.rubular.com


I learned regular expressions when writing the documentation for my Sleep programming language. They never made sense to me until I had to organize my thoughts on them in a logical way. I'm still quite happy with how that chapter turned out:

http://sleep.dashnine.org/manual/regex.html


Here's a cool online tool my friend (HN user KrisJordan) made:

http://www.gethifi.com/tools/regex

It's super handy when learning Regular Expressions because it shows results as you type.


Obligatory emacs post: M-x re-builder is also pretty handy. Of course, then you're using emacs re syntax, which is a bit clunkier.


While we’re at it, could someone please explain why \[[^\[\]]*\] does not match things like [X], where X is a string of non-square-bracket characters? How would you fix it? Thanks!


In what language? That expression works fine for me when I test it?


Say in ed or sed.


    $ printf 'a[]b[c]d[ef][g[h]i\n' | sed 's/\[[^][]*]/-/g'
    a-b-d-[g-i
    $ ed
    a
    a[]b[c]d[ef][g[h]i
    .
    s/\[[^][]*]/-/g
    p
    a-b-d-[g-i
    q
    ?
    q
    $


I was surprised to find no mention of the issues with nesting. Every programmer should know about this limiation, even if they don't get into the pumping lemma and whatnot.


(2007)

kind of disappointed that there's 13 comments and no mention of jwz's quote.

in terms of actual thoughts, this is certainly an aptly named article.


Of course there's always the counterpart to jwz's quote:

> Some people, when confronted with a problem, think "I know, I'll quote Jamie Zawinski." Now they have two problems.

[http://twitter.com/diveintomark/statuses/1249729494]


awesome.


1. Regular expressions cannot match recursive patterns.

2. No, they cannot! extended regexps are a nice hack, but not really met formal regular expression definition.

3. man grep


Did someone learn something on this site? Why it's on hacker news instead of newbies news.


Meh, no one here wants to here my spoiled brat opinion, so I apologize in advance:

It should also be included in the bare minimum:

Regex will drive you crazy. Not just writing the expression, but installing a library (especially if there is no handy installer) ect. Regex likes to punch you in the balls whenever possible. (I can say this because I've written many line of Regex and I do love it.)

.

.

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -Jamie Zawinski




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: