Regular expressions are strings formatted using a special pattern notation that allow you to describe and parse text.
"Parse" is a questionable word choice. A proper Regular Expression can only describe patterns found in "regular languages", though modern backtracking regex engines have a bit more power. Parsing implies a parser, which means matching a grammar somewhat more sophisticated than a regex engine can handle.
I don't agree, technically. A parser is a program which analyses text and determines its grammatical structure according to a formal grammar, and tests its membership in the language described by the grammar.
Regular languages can be described by regular grammars, and regular expressions are a notation for regular grammars. So you can indeed parse text with reference to a regular expression.
Most people would call a parser which only used strictly regular expressions a lexical analyser, lexer or tokenizer, but this is only a convention to simplify language definition and parser implementation when using simple techniques like LALR or LL.
But other uses may not need that level of sophistication. For example, a packet analyser in a router might use a dynamically updated state machine (a DFA), built from multiple regular expressions, for efficient dispatching of packets based on rules. I don't think it would be wrong to describe that component of the router as a packet parser.
For example, this (to me) blindingly obvious patent seems to be using a modified DFA to make routing decisions, and calls it a parser:
Technically, you're probably right. There's no hard-and-fast definition of "to parse" that requires the grammar to be particularly sophisticated. It is still not the best word choice because "parse" has a particular connotation, though.
Agreed. I cringed when I read that line. Parsers can do things that regular expressions can't (paren matching, for example). Yet I've seen people try using regular expressions when a parser would be a better choice.
There seems to be this emerging pattern in my career where I stumble across all the people in the world who have decided to try and validate HTML using regular expressions.
I don't think there is anything in "parse" that implies relation to a specific type-N grammar, why do you think so? (the oxford american heritage dictionary seem to agree, but I'm not a native speaker)
Regular expressions in modern programming languages and the regular expressions you learned about in your automata theory class are not the same thing. The back references present in the regular expressions discussed in the article make the entire system capable of handling even NP-complete problems. So yes, regular expressions (in the more common use of the term) can "parse".
I can only hope this challenge inspires a spiteful hacker to create a full grammar parser in a regular expression. It won't be me, but I love how these types of comments can sometimes spawn the most interesting toy projects. :)
Considering he listed just about everything I can remember about regexes without google, and nothing more, I believe I'm duty bound to upvote.
My only quibble: I'd promote the character classes (especially \s and maybe \S) into this list from the "expert" category. Here's part2 (or what I call: stuff I can google later, if I need it), if anyone wants to keep reading: http://immike.net/blog/2007/06/21/extreme-regex-foo-what-you...
The title makes me wonder... I was an okay programmer before I learned about regexes. When the need arose, I learned how to use them easily. Should I have learned them before that?
Many programmers seem to think that they possess just the right amount of knowledge. Someone who knows Unix utilities but not algorithm design will argue that Unix utilities are vital, but algorithm design is a random obscure topic; someone with the opposite skillset will defend it just as eloquently. I think programmers have no obligation to know anything. If you can do the job, I don't care how much you use Google.
I'm spoiled enough to think that boring things shouldn't be deliberately memorized. It's enough for me to know the mathematical concept of a regular language, which lets me recognize situations where regexes would help. For concrete syntax we always have cheatsheets. This attitude has its advantages: for example, I never needed to be explained why parsing XML with regexes is a bad idea.
I agree that boring things shouldn't be memorized, and that being familiar with the concept of a regular language helps identify where a regular expression could help.
Two points:
+ A lot of programmers aren't familiar with the concept of a regular language, especially those that are hobbyists, or who went to shitty schools. Sad but true.
+ Regular expressions are useful commonly enough that I would be very surprised if an okay programmer with any sort of experience under their belt couldn't remember at least [] . * ? + .
I definitely agree with your main point. There is, however, a weird subjective layer of knowledge that tends to be present in developers of any experience. That doesn't mean that I would look down on an individual programmer who claimed experience that didn't know how to make a regex character class off the top of their head, but I would be a bit surprised.
I agree. When you're planning to invade Iraq, you need people who shoot well, and you need people who have been to Iraq before. They don't have to be the same people, though.
We had a great experience at one of our contract jobs: we found a smart math student with no webdev experience, and he did most of the programming while I constantly consulted him on "platform issues" and jumped in to code any tricky parts he had problems with. He became a competent webdev in about two months - way faster than it took me to learn this stuff by myself - then I handed the project over to him and left. I wonder why other employers don't do this more: mix experienced programmers with smart newbies who come from hard-science backgrounds. Seems to be a winning combination.
It's very disappointing to find the author didn't mention greedy matchers. Consider the difference:
>> s = "hello world, I love you world"
>> s.sub(/(.*?)world/, '\1universe')
=> "hello universe, I love you world"
>> s.sub(/(.*)world/, '\1universe')
=> "hello world, I love you universe"
Seeing how im advocating that everyone should know when to be using parsing tools rather than cobbling together a regexp cludge, here are some links to some relevant tools in a variety of languages. Note that these are not parser generator tools, but parser combinator libraries:
* Python:
PyParsing - http://pyparsing.wikispaces.com/ - this
is probably the most well respected parser library
for python
picoparse - http://github.com/brehaut/picoparse - My
parsec inspired parser combinator library. It is
significantly more light weight than pyparsing, and
I would recommend pyparsing in its place for more
heavy lifting parsers.
* Ruby:
RParsec - http://rubyforge.org/projects/rparsec/ - A
ruby implementation of Parsec. I've heard good
things about it.
* Haskell:
Parsec - http://legacy.cs.uu.nl/daan/parsec.html -
This is an amazing library, worth anyones time.
http://book.realworldhaskell.org/read/using-parsec.html
http://www.haskell.org/haskellwiki/Parsec
* F#:
FParsec - http://www.quanttec.com/fparsec/ - This is
reason enough to learn F#; it is a fairly true port
of the haskell library and very nice to use.
* Scala:
Packrat Parsing - I believe this is part of the
standard lib. http://www.scala-lang.org/docu/files/api/scala/util/parsing/combinator/PackratParsers.html
* Clojure:
fnparse - http://github.com/joshua-choi/fnparse
* Java:
jparsec - http://jparsec.codehaus.org/
I think this actually illustrates the other guy's point. If you want a parser, you need to go download code off the internet. If you want a Regex, you type something along the lines of new Regex().
Your link advocates writing parsers in situations where regexes aren't powerful enough, which is quite sensible and even quite practical in some cases.
But there's a big reason regexes are overused compared to parsers--regex engines come stock in most programming languages, and parser generators don't. There's a big mismatch in complexity between overusing something the language gives you for free and building in a parser. Perl 6 deserves some credit for their "rules" system, which can actually take a grammar and parse from it.
Agreed, the impedance of parsers has traditionally been much higher than for regexps. However, a lot of languages now have some sort of parse combinator library which dramatically increases the ease of writing a parser.
I'm thinking specifically of the Parsec inspired family, which has implementations or derivations in many common languages, even C# and Java (despite being noticably more cumbersome in those cases there).
[I cannot speak for perl 6, but i trust you are correct]
One thing everyone should know about regular expressions is that the things called "regular expressions" provided by libpcre (which, of course, stands for "Perl compatible regular expressions") aren't regular expressions at all. Since they support backtracking, they provide a more powerful parsing framework than regular expressions alone (which don't support backtracking). That's how the (in)famous regular expression that recognizes prime numbers can work, even though the language of prime numbers is not regular.
For some reason, I've never been able to get the knack of regular expressions. This is the best introduction to them that I've seen, though I wish I'd read it years ago. However, if you've tinkered with regex at all ever, which I suspect virtually everyone here has, you'll know this. (I know it, and I hate and avoid regular expressions and am not even a programmer.)
vim can be a powerful tool for visualizing regular expressions as well. paste in the text and set incsearch and set hlsearch to match on partial expressions and highlight matches. this allows you to distinctly see the effects of a regex as you type it.
also, python's regex module can show you how it interprets a regular expression by passing the re.DEBUG flag to re.compile:
>>> import re
>>> re.compile('a*b+', re.DEBUG)
max_repeat 0 65535
literal 97
max_repeat 1 65535
literal 98
<_sre.SRE_Pattern object at 0x100460230>
It's weird, because they've always just clicked for me. I mean, you have text and you compare it to a pattern. The regex is just a crazy shorthand for the pattern you're looking for. What pattern should you match? Well, first write down examples of what you want to match and see what they have in common.
I think it's easier to get if you match a pattern to some text by hand, taking a simple pattern like ^a+bc$ and trying to see if it matches:
Here are a few test cases:
"aaaaaaaaabbbc" ?
1) ^a+ means it starts with 1 or more 'a's, so those eat all my 'a's and I'm matching bbbcc against 'bc' now
2) b* means 0 or more 'b's; those eat up everything but the c at the end.
3) c$ matches exactly one c at the end, so yeah, we have a match. If there had been one more c (or anything else), the match would fail because it wouldn't hit the $ (which is end of string).
aaac ?
1) ^a+ eats all the 'a's again.
2) b* ... what b? It's "zero or more" so yes, we can skip it.
3) c$ Yes, there's one c at the end. We have a match.
bc ?
1) We need one* or more 'a's at the start. That's the difference between + and . Match fails immediately.
acbbc ?
1) ^a+ matches the a.
2) b means it's optional, so we don't fail... yet. But we don't match anything, either.
3) c$ That c matches, but we're not at the end of the string yet. There's still "bbc" left to match and we're out of pattern. Fail.
You can do something similar with the tools other people have linked in comments. This was just to give an example pattern and a few test cases to get you started.
It's weird, though. Even though this clicks for me, I don't feel like I get monads at all. Yeah, I've read the 3 monad laws (great... so something being a monad means that I can monad it, un-monad it and do something vaguely similar to composition on monads) and I can do some very simple Lisp which uses built-in stuff they say is built from monads (like printing). But I don't know how they work internally at all, or how they're fundamentally different from all the other Lisp statements ("causes side affects" isn't terribly specific... why is printing to the screen a side effect? Is setq a monad?).
In other words, there's probably some topic for every programmer that they really have to struggle with. As easily as regexes come to me, I struggle to understand other stuff.
I've been at it for quite a while, incidentally. But I'm not going to give up any time soon. So keep working on it.
It does help, but I think there's still a lot I don't know. I mean, 'bind' and 'return' still seem like the names are backwards to me. Somehow, I feel like 'return' should give me a value, rather than shoving a value into a monad. But that's probably because I'm used to C and its relatives.
Perhaps I simply need to read up more on FP itself. I have a huge bookshelf full of half-read books (and yes, plenty of fully-read books) but nowhere near enough free time...
I suspect your eyes tend to glaze over because of the many strange symbols? Just relax - they are not actually hard, it looks more complicated than it is. That there used to be a whole O'Reilly book dedicated to them also made them look more complicated. But there are books for everything (probably even Notepad).
I learned regular expressions when writing the documentation for my Sleep programming language. They never made sense to me until I had to organize my thoughts on them in a logical way. I'm still quite happy with how that chapter turned out:
While we’re at it, could someone please explain why \[[^\[\]]*\] does not match things like [X], where X is a string of non-square-bracket characters? How would you fix it? Thanks!
I was surprised to find no mention of the issues with nesting. Every programmer should know about this limiation, even if they don't get into the pumping lemma and whatnot.
Meh, no one here wants to here my spoiled brat opinion, so I apologize in advance:
It should also be included in the bare minimum:
Regex will drive you crazy. Not just writing the expression, but installing a library (especially if there is no handy installer) ect. Regex likes to punch you in the balls whenever possible. (I can say this because I've written many line of Regex and I do love it.)
.
.
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
-Jamie Zawinski
Regular expressions are strings formatted using a special pattern notation that allow you to describe and parse text.
"Parse" is a questionable word choice. A proper Regular Expression can only describe patterns found in "regular languages", though modern backtracking regex engines have a bit more power. Parsing implies a parser, which means matching a grammar somewhat more sophisticated than a regex engine can handle.