Regular expressions in lexing and parsing (2011)

eesmith · on May 3, 2020

> It's not too hard to write the regexp (something like "[a-ZA-Z_][a-ZA-Z_0-9]*"), but really not much harder to write as a simple loop. The performance of the loop, though, will be much higher and will involve much less code under the covers.

My experience with using Ragel as the lexer is that Ragel emits some high performance code from that sort of regexp.

A couple of weeks ago I looked at a hand-written lexer for integers. It was supposed to match "-?[0-9]+". It ended up allowing values like 00-123 because of a bug in the '-' detection code.

Finding that bug took some close reading.

Which means I'm not convinced about the conclusion:

> ... don't write lexers and parsers with regular expressions as the starting point. Your code will be faster, cleaner, and much easier to understand and to maintain.