Hacker News new | past | comments | ask | show | jobs | submit login
My favorite regex of all time (catonmat.net)
272 points by cleverjake on Nov 12, 2012 | hide | past | favorite | 111 comments



As someone who makes much of his living rehabilitating old perl scripts, please, if you must use such things, use them like this:

[ -~] #match only printable characters

It takes 5 seconds longer and with regexes, just knowing what the damn thing is trying to do is half the battle. When you use a regex, use a comment. Its the civil thing to do.


I recommend using the /x suffix to extend your pattern's legibility by permitting whitespace and comments.

/x allows you to break up your regex into its component parts, one part per line, and then comment each part.

Here is what the manual says about /x:

/x tells the regular expression parser to ignore most whitespace that is neither backslashed nor within a character class. You can use this to break up your regular expression into (slightly) more readable parts. The # character is also treated as a metacharacter introducing a comment, just as in ordinary Perl code. This also means that if you want real whitespace or # characters in the pattern (outside a character class, where they are unaffected by /x), then you'll either have to escape them (using backslashes or \Q...\E ) or encode them using octal, hex, or \N{} escapes. Taken together, these features go a long way towards making Perl's regular expressions more readable.

http://perldoc.perl.org/perlre.html


Yea, anytime I use a regex that isn't immediately obvious I put it in a function called get_<something>. Unfortunately people that write overly complicated and error prone regexes usually don't choose to document them.


If a regex is going to be reusable, then yeah, I'd agree. But dumping single lines of code into their own functions just for readability isn't practical for real time systems. In those cases you really should be using comments as they get stripped out by the compiler.


Couldn't those functions just be inlined by the compiler if they're simple regex-wrappers anyway?

I do agree that it might be overkill to move regexes to their own functions just for readability's sake but I don't buy the performance argument. Furthermore, regexes are most popular in scripting languages that no sane person would use for real time performance-critical systems anyway.


1. Ahh right. I wasn't aware that happened.

2. Web sites are a classic example of scripting languages being used for real time performance critical systems (though I'm not arguing that all web sites are real time).

Sometimes the ability to modify code easily is as important to the choice of languages as the raw execution speed of the compiled binaries.


REGEXES aren't practical for real time systems.

If you're using a regex, and certainly if you're using a language other than C, you probably have space for the function call overhead.


I don't really agree with that.

Sometimes C is inappropriate (eg you'd be nuts to build a website in C yet some sites do offer real time services)

Often the data set and/or logic required makes C an inappropriate language (eg you wouldn't use C for AI nor for some types of database operations).

And even in the cases where you're just building a standard procedural system, sometimes the interface lends itself better to other languages (eg C would be possibly the worst language for real time websites.)

But even in the cases where you're building a solution that's suited for C, there are still other performance languages which could be used.

"Real time" is quite a general term and as such, sometimes it makes more sense to use scripting languages which are performance tuned. Which is where writing 'good' PCRE is critical as RegEx can be optimised and compiled - if you understand the quirks of the language well enough to avoid easy pitfalls, eg s/^\s//; s/\s$//; outperforms s/(^\s|\s$)//; despite it being two separate queries as opposed to one.


"Real time" is commonly assumed to mean that you can't use a garbage collected language or need to be extremely careful doing so because random pauses of 100ms break your constraints.

If you're in a situation where the overhead of a couple of function calls is unacceptable, regexes are totally unacceptable and you need to write custom character manipulation.

This situation is really rare and in almost all business cases, using C is inappropriate.


Shouldn't most compilers (jit-)inline it?


or even better # Match only printable ASCII characters.


The writer of said script needing rehabilitation probably doesn't have that much insight. Just try to tell me what you were trying to get done and that will be enough.

The worst case is when the original author never really had it clear in his/her mind what exactly that compound regex was trying to accomplish. They just kind of bodged and hacked till the usual input stream started coming out right. Trying to write a clear comment on the purpose of the regex helps with that too.


Thank you for that. You have no idea how annoying it is to port perl scripts from ASCII to EBCDIC when they do that kind of thing.


It's not an ASCII v EBCDIC thing, its an ASCII vs Unicode thing.


It's not just Unicode either. I just mentioned EBCDIC because that particular regex has bit me before when I was translating perl scripts from Linux to zOS USS. Take a look at the code page for EBCDIC, you'll see quickly why it's a massive pain to sort through regexes like that.


I honestly thought you were being sarcastic. I've never heard of someone who has actually used EBCDIC.


I'm sure you've heard of the IBM AS/400 which is still firmly entrenched in MANY Fortune 500 companies. Not to mention tons of state and county government installations handling payroll, inventory, taxrolls, etc. I had to deal with a Perl script which dealt with ASCII to EBCDIC to port data to an Oracle database. If you're a Windows only shop, that's fine, but don't assume that anyone whom isn't is ancient.


So did I! Now that's a war story....


More generally, it's a characterset / collate sequence thing. Specifying a range with a start and end point requires understanding what that range specifies. Which can change depending on context, locale, characterset, etc.


Also in the 32-127 ASCII range? I thought they just differ in 128-255 with the code pages and such?


In the case of EBCDIC, there are several places in the alphabetic collation sequence in which non-alpha characters are interspersed among the letter codes. Most notably between R & S, though it appears that I-J also includes a standout. The fact that there are multiple incompatible forms of EBCDIC doesn't help matters much.

Makes sorts really tweaky.

http://en.wikipedia.org/wiki/Ebcdic


It's both, but ASCII vs. EBCDIC is worse. Even in Unicode, the regex will still grab the printable characters that also happen to be part of ASCII: you won't see anything wrong until you get to characters outside that range. In EBCDIC, things get much hairier: it won't get capital letters, nor lowercase letters from r through z (but it will get all the other lowercase letters), nor brackets or braces (though it will get parens).


Or just use [[:print:]]


Unless you want to seem clever and impresse the PHB.

It is the selfish (but smart) thing to do.


I'm not sure who the PHB is, but I'm certainly not impressed by anything cryptic in a codebase. Deliberately writing code that's hard to understand should be a firing offense.


PHB: Pointy Haired Boss


In that case, the smart thing to do is not to work for the PHB, rather than pervert your craft in an attempt to impress him/her.


In this case, the entire post was the comment.

You are right anyway.


Google is by far the best "comment"


Are you saying people should google regular expressions? in my experience (correct me if I'm wrong) that doesn't work, I've never been able to get google to return relevant results even with quotation marks.


Agreed, Google fails at this, however alternative search engines,

http://searchco.de/?q=%5B+-~%5D+ext%3Apod&cs=on http://symbolhound.com/?q=%5B+-~%5D

Symbolhound gives the answer quite well, and searchco.de has some examples of its use in the results.


I'm saying that usually comments are either wrong or out of date, developers code one regex, comment it, then fix a bug later and don't, then there's a discrepancy between the comment and the code. It's nearly always easier to just google the code and see what it does, if (as in this case) it's not obvious.


Your response doesn't address what citricsquid said, googling for a regex will almost never return helpful results.


Google regex and you'll find plenty of resources including tools to testing patterns. You won't find much for any specific pattern but read the docs and it will be apparent what this regex does. Familiariy and competence with regex is a basic component of being a developer.


or make a function regexMatchingAllPrintableASCIIChars() and have it return the regex.


Search "[ -~]" (with or without quotes) to see how good Google's comment is.


The only ambiguous thing about this regex is knowing what's between space and tilde. Otherwise this is a pretty ordinary regex.


Hey, might just be me. I'm usually the 'Ben, can you help me with a regular expression' guy over here, but I stumbled, hard, and failed to connect the '-' with a range of characters (probably because I never thought of 'space to .. something').

So I read the snippet, thought 'Yeah, a character class of space, -, ~' and fell on my face in the next couple of lines.

Yeah, I should've known better, I know how to read it. If .. I invest the time and don't glance over a construct and hope to just get it instantly.

I wouldn't want to see this in a code base without proper documentation (be it a comment, a function name or whatever. Something).


The only thing ambiguous about it is most of it?


Not that I agree with the "expect people reading your code to Google things" mindset, but to be fair the only ambiguous thing is the ASCII table which is Googleable.


There is one author of the code, and potentially many readers.

That one author is the only person who knows what s/he is trying to achieve.

That author taking a few minutes to add some comments will save other people the time to search for answers and the time it takes to grok everything.


The best code is readable. Readability includes comments. If you're going to comment anything in your code at all, RegExes should be at the very top of that list.

Even if I can figure out what the regex matches (with Google or something else), that doesn't necessarily tell me WHY I'm matching on that particular pattern, or why I needed a RegEx in this spot, or what the intent was at the time of writing it.


This will not only miss non-ascii printing characters, but it's not even much shorter than typing

  [[:print:]]
to use the explicit character class.


The [[:print:]] will match any printable characters like åä, while the [ -~] will not.

I used this once as another safeguard against pushing binary data into the database. It was a poor system to begin with where you even have that possibility... and it happened at least once before the fix and my safeguard was in place.


"å" is perfectly valid text input in my locale.


There will be situations where you need to check specifically for 7-bit ASCII printable characters only. I've worked with APIs that require everything outside that range to be escaped/encoded into it.

Email could be an example, I guess, although I haven't worked with it enough to know whether the whole "7-bits only" thing is still an issue these days.


I think that was his point, that he had a good use for :print: over just -~


Jeepers... cut the guy some slack. He didn't say this is the bullet proof way of doing everything YOU want to do in all situations, every time, forever. He said "I thought I'd share my favorite regex of all time". And then explained what it does. Why does everyone have to poop on his favorite thing?


My favorite regex is the following,

/^1?$|^(11+?)\1+$/

Which finds prime numbers. Although, I can't for the life of me think of a reason for using it.

http://stackoverflow.com/questions/3296050/how-does-this-reg...


I do dislike people calling that expression a "regex", because it isn't: regular expressions cannot contain backreferences, and must be computable in linear time, whereas primality tests are polynomial.


While I agree I believe this comment by _delirium sums this up rather well,

http://news.ycombinator.com/item?id=1486502

full comment thread here http://news.ycombinator.com/item?id=1486158


I agree more with philh's response that there is no alternative term for the true meaning of "regular expression" — a regular language, as suggested by _delirium, is not the same thing.

I suppose I could accept "regex" as not being a regular expression as such, but the two are used so interchangeably that maintaining a distinction isn't very realistic. I'd personally rather a regular expression described a regular language, and "PCRE" (or so) used for the Turing-complete expressions with a similar syntax.


I'm not a big fan of your explanation. To be more precise, true "regular expressions" are computationally equivalent to deterministic finite automata, which indeed can test an n-character string in O(n) time.


NFAs and DFAs both recognise the regular languages (and only them).


It's PCRE (Perl Compatible Regular Expressions) which is one of the most popular dialects of regex. But AFAIK there's isn't a hard and fast RegEx standard.

So I'd argue that code is RegEx.

I guess it's just a matter of perspective though.


I had to prove that in a formal languages class once and I still have no idea how it works.


First, it does not match "prime numbers". It matches composite numbers in unary notation (n is represented by n '1' characters).

The first part (^1?$) allows "" and "1" to match (so that 1 is not detected as a prime).

The second part matches groups of two or more ones (11+?), repeated twice or more, ie products n*m, n ≥ 2, m ≥ 2.

The backreference means that \1 should match the exact same string as the first (11+?). It's different from using (11+?){2,} which would match n_1+n_2+n_3..., n_1 ≥ 2, n_2 ≥ 2, n_3 ≥ 2 (where submatch is independent).


Are people seriously still deliberately using ASCII-reliant code?


It's interesting. Doesn't mean it's worthy of being put in production code.


it does if the text you are dealing with is specified as ascii only


For file names, URLs, domain names, etc. it's usually the safe thing to do.


Who's filenames aren't unicode? Also domains and URLs can be unicode too.


> Who's filenames aren't unicode?

Many filesystems don't support unicode or support only a subset of it:

https://en.wikipedia.org/wiki/Filename#Comparison_of_filenam...

> Also domains and URLs can be unicode too.

Domains: it depends at which level you are dealing with them. See https://en.wikipedia.org/wiki/Internationalized_domain_name

    Internationalized domain names are stored in the Domain 
    Name System as ASCII strings using Punycode transcription. 
URLs: Unicode characters are not allowed in URLs. See http://www.faqs.org/rfcs/rfc1738.html and http://www.blooberry.com/indexdot/html/topics/urlencoding.ht...

    only alphanumerics, the special characters "$-_.+!*'(),", and
    reserved characters used for their reserved purposes may be used
    unencoded within a URL.



exists in DNS as xn--ebkur-tra.is



Not as often as often as handling all characters.


Every time I've had to deal with Unicode and internationalization, it's been a problem.

For example, a few years ago I grabbed a source tarball from somewhere, I forget what or where. It had the author's name in a comment, which included an O with dots over it. That was the only non-ASCII character in the source code. No matter what I did, both Eclipse and command-line javac refused to compile the source.

Finally I wrote a script to delete his name from every source file manually. It compiled flawlessly.

Then there's the time I found some text files with two characters of binary junk at the beginning, followed by completely normal text. Again, I forget what I was doing, but some program was refusing to process them correctly. It was something internationalization-related called the BOM. Eventually I ended up writing a script to walk a directory and remove the first two bytes of every file. (This can probably be done with dd and xargs on UNIX, but I was using Windows at the time, which means that something like this will require spending an hour or so in your favorite programming language.)

These experiences lead me to believe that, for bootstrapped USA startups at least, you shouldn't worry about a market outside the English-speaking world.

If you need to worry about junk like accented characters or moon runes (Chinese/Japanese/Korean characters), it means you're big enough to afford to hire someone specifically to address the problem.


I assume this is a not very subtle troll? Java source is unicode? (The offhand reference to dd and xargs is a bit too much).

How do you define "English-speaking world", btw? Those too ignorant to have heard of non-ascii-characters (ie: excluding Canada, as anyone doing business there should at least have heard of French)?

Anyway, for anyone actually burnt by something similar on a GNU system try looking up recode(1).


What? You suffered from other peoples' bad internationalization, which implies that people shouldn't care about internationalization?


BOM sounds more like an issue with you switching Unicode documents between Windows and Unix, rather than a problem with internationalisation.

http://en.wikipedia.org/wiki/UTF-8#Byte_order_mark

And personally I think to exclude all internationalisations because they're harder is a terrible attitude to have. Particularly these days when there's an online tutorials for pretty much any job imaginable (not to mention the numbers of helpful experts willing to give up their time for free on various forums and communities).


> which means that something like this will require spending an hour or so in your favorite programming language

Ok, this is where I stop worrying about how quickly I write code. Did this (removing BOM) quite a few times and it took just a few minutes in Python (under Windows). Heck, this could be two-liner I think :)


I, for one, applaud this attitude. It gives programmers and companies that know what they're doing a leg up over people who couldn't even bother to figure out UTF-8. Natural segmentation of a target market is a good thing.


I sense a daily wtf material here.


Yes, when dealing with RFC's that do.


I think HN is written Arc, which is not very Unicode friendly.


Δοκιμή.

EDIT: It works fine for comments, at least.


Did ducttape stop being sticky?


It's clever, but it's also completely unreadable for anyone who didn't read this article. Regexes have serious maintainability issues as it is; let's not make it worse by putting clever tricks in them.


I don't understand why this is "completely unreadable".

What else could this have been besides match the character range from space to tilde?


The main risk in my mind was that it was some sort of control sequence for a feature I hadn't memorized.

I know there's some syntax I can use to create a zero width negative look behind recursive greedy named capture group back reference. Perhaps hyphen-tilde triggers something like that.


Most people would have to check an ASCII table to know what that range is, though.


Which takes for granted the fact that your input stream is even ASCII to begin with. I'm too lazy to check, but I'm pretty sure this isn't going to catch all printable Unicode characters, for example - and then you're left scratching your head over what the hell the original author was trying to achieve.


Presumably the space, commonly having no meaning in and of itself, could throw you for a moment or two. This isn't a regex `foo_[a-z]`, you have to stop and think about it for a moment.

I don't think it is particularly bad though. It's just not the most trivial of regexes.


Every regex seems like a clever trick.



Ah, this is my favorite also. If seeing this doesn't make you second guess using a RegExp when a parser is more appropriate, well...you might be a Perl programmer?



This seems to be a T-shirt advert, why am I reading this on HN?


why am I reading this on HN

Because enough people voted it up within a set time window.


I'm sorry that it sounds like it. It's really not. I commented about it on this thread http://news.ycombinator.com/item?id=4775100.


I suppose a single regex can be both "favorite" and "worst" at the same time... it's only slightly interesting to know where ~ appears in the ASCII character set, and while someone might recall that space is kinda near the beginning but after the control characters, is it the first helpful printable character? Who knows?


> I suppose a single regex can be both "favorite" and "worst" at the same time...

We definitely aren't the only ones who appreciate horrible things.

INTERCAL comes to mind here.


My favourite regex is actually:

[^ -~]

Not to be used in a serious program, but only in an editor (or maybe one-shot data massage perl scripts), to find possible errors or unexpected stuff.


Also it's more interesting to put unprintable characters on a t-shirt.


This works for ASCII only, use unicode character classes instead.


That only matters if you need to process Unicode.

See my comments [1] [2] [3] for why Unicode / internationalization should be avoided.

[1] http://news.ycombinator.com/item?id=4369323

[2] http://news.ycombinator.com/item?id=4541039

[3] http://news.ycombinator.com/item?id=4775440


So, do you propose that u.s. bootstrapped startups have a disclaimer on the registration page saying: "you cannot put foreign characters anywhere in our system"?

Even if you focus on u.s., you will have problems. If you're doing a CRM, even u.s. users will put in foreign names from time to time. If you're building a CMS, users may want to put in a quotation in french, or will simply use copy&paste from Word, which replaces "-" with "—"...

I honestly have a hard time finding a u.s. centric startup which could afford to ignore unicode. The support requests, the fires caused by errors, and the disclaimer that you'd have to put on the registration page, would cost much more than simply learning how to code the f'n utf.

Building MVP is good practice in Lean. Saying "I'm bootstrapping hence I don't have the time to learn the programming tools" is just ignorance and incompetence. It's not like Unicode gives you extra work, it just requires you to learn a few basic concepts. If you try to build a site which doesn't support Unicode, you'll have to put a lots of safeguards everywhere to cover up for your incompetence.


Why?


I like to use a variation of this in vim to quickly see if an html doc I'm working on contains weird characters that I might want to replace with &html; entities:

  /[^ -~]


[\p{L&}] <- unicode version, in case you were wondering.


No, you don't need '[' and ']' ? Also you don't need the '&' to get the equivalent of the above, but might have to add support for spaces?

  > \p{L} or \p{Letter}: any kind of letter from any language.
vs

  > \p{L&} or \p{Letter&}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
Along with:

  > \p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
Considering the op matches everything printable, including whitespace (or actually just space, not tab), numbers and punctation, I think the equivalent would be "\X" ?

All this based on glancing at:

   http://www.regular-expressions.info/unicode.html


Wouldn't it be smarter, in that case, match something like [\x20-\x7F] (unsure if that's valid regex, but you get the point), it's more explicit that way, being very obvious about what characters are included, as well as immediately being clear (to me) about the intention of the character class. "0x20 to 0x7F" triggers the idea of "Printable ASCII" a lot sooner than <space> to <tilde>.


what a plug - obviously the only real purpose of this was to sell those t-shirts.


I'm sorry that was not the purpose. I just wanted to share this regex trick that I had in my mind. I added the shirts only later when I saw that the article is getting very popular. People really seemed to like my previous tees and it's great to make a little extra money and continue doing what I love - coding and writing blog posts.


well obviously people are interested in it. Can't fault you for trying to make a little scratch


Not all strings are ASCII! :-(


Can anyone explain how this regex [- ~] matches ASCII characters ?


It's pretty simple. Assuming you know regex... Im going to assume you don't since you are asking.

The bracket expression [ ] defines single characters to match, however you can have more then 1 character inside which all will match.

  [a] matches a
  [ab] matches either a or b
  [abc] matches either a or b or c
  [a-c] matches either a or b or c. 
The - allows us to define the range. You can just as easily use [abc] but for long sequences such as [a-z] consider it short hand.

In this case [ -~] it means every character between <space> and <tilde>, which just happens to be all the ASCII printable characters (see chart in the article). The only bit you need to keep in mind is that <space> is a character as well, and hence you can match on it.

You could rewrite the regex like so (note I haven't escaped or anything in this so its probably not valid)

  [ !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~]
but that's not quite as clever or neat.


It doesn't. Space is significant here, and if '-' is a the front of the matching character class it matches literal '-'. Your regex '[- ~]' matches either '-' or ' ' or '~'.


I don't want to be "that guy" and it's probably just my own stupidity, but what's so special about this for it to frontpage HN?




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: