As someone who makes much of his living rehabilitating old perl scripts, please, if you must use such things, use them like this:
[ -~] #match only printable characters
It takes 5 seconds longer and with regexes, just knowing what the damn thing is trying to do is half the battle. When you use a regex, use a comment. Its the civil thing to do.
I recommend using the /x suffix to extend your pattern's legibility by permitting whitespace and comments.
/x allows you to break up your regex into its component parts, one part per line, and then comment each part.
Here is what the manual says about /x:
/x tells the regular expression parser to ignore most whitespace that is neither backslashed nor within a character class. You can use this to break up your regular expression into (slightly) more readable parts. The # character is also treated as a metacharacter introducing a comment, just as in ordinary Perl code. This also means that if you want real whitespace or # characters in the pattern (outside a character class, where they are unaffected by /x), then you'll either have to escape them (using backslashes or \Q...\E ) or encode them using octal, hex, or \N{} escapes. Taken together, these features go a long way towards making Perl's regular expressions more readable.
Yea, anytime I use a regex that isn't immediately obvious I put it in a function called get_<something>. Unfortunately people that write overly complicated and error prone regexes usually don't choose to document them.
If a regex is going to be reusable, then yeah, I'd agree. But dumping single lines of code into their own functions just for readability isn't practical for real time systems. In those cases you really should be using comments as they get stripped out by the compiler.
Couldn't those functions just be inlined by the compiler if they're simple regex-wrappers anyway?
I do agree that it might be overkill to move regexes to their own functions just for readability's sake but I don't buy the performance argument. Furthermore, regexes are most popular in scripting languages that no sane person would use for real time performance-critical systems anyway.
2. Web sites are a classic example of scripting languages being used for real time performance critical systems (though I'm not arguing that all web sites are real time).
Sometimes the ability to modify code easily is as important to the choice of languages as the raw execution speed of the compiled binaries.
Sometimes C is inappropriate (eg you'd be nuts to build a website in C yet some sites do offer real time services)
Often the data set and/or logic required makes C an inappropriate language (eg you wouldn't use C for AI nor for some types of database operations).
And even in the cases where you're just building a standard procedural system, sometimes the interface lends itself better to other languages (eg C would be possibly the worst language for real time websites.)
But even in the cases where you're building a solution that's suited for C, there are still other performance languages which could be used.
"Real time" is quite a general term and as such, sometimes it makes more sense to use scripting languages which are performance tuned. Which is where writing 'good' PCRE is critical as RegEx can be optimised and compiled - if you understand the quirks of the language well enough to avoid easy pitfalls, eg s/^\s//;s/\s$//; outperforms s/(^\s|\s$)//; despite it being two separate queries as opposed to one.
"Real time" is commonly assumed to mean that you can't use a garbage collected language or need to be extremely careful doing so because random pauses of 100ms break your constraints.
If you're in a situation where the overhead of a couple of function calls is unacceptable, regexes are totally unacceptable and you need to write custom character manipulation.
This situation is really rare and in almost all business cases, using C is inappropriate.
The writer of said script needing rehabilitation probably doesn't have that much insight. Just try to tell me what you were trying to get done and that will be enough.
The worst case is when the original author never really had it clear in his/her mind what exactly that compound regex was trying to accomplish. They just kind of bodged and hacked till the usual input stream started coming out right. Trying to write a clear comment on the purpose of the regex helps with that too.
It's not just Unicode either. I just mentioned EBCDIC because that particular regex has bit me before when I was translating perl scripts from Linux to zOS USS. Take a look at the code page for EBCDIC, you'll see quickly why it's a massive pain to sort through regexes like that.
I'm sure you've heard of the IBM AS/400 which is still firmly entrenched in MANY Fortune 500 companies. Not to mention tons of state and county government installations handling payroll, inventory, taxrolls, etc. I had to deal with a Perl script which dealt with ASCII to EBCDIC to port data to an Oracle database. If you're a Windows only shop, that's fine, but don't assume that anyone whom isn't is ancient.
More generally, it's a characterset / collate sequence thing. Specifying a range with a start and end point requires understanding what that range specifies. Which can change depending on context, locale, characterset, etc.
In the case of EBCDIC, there are several places in the alphabetic collation sequence in which non-alpha characters are interspersed among the letter codes. Most notably between R & S, though it appears that I-J also includes a standout. The fact that there are multiple incompatible forms of EBCDIC doesn't help matters much.
It's both, but ASCII vs. EBCDIC is worse. Even in Unicode, the regex will still grab the printable characters that also happen to be part of ASCII: you won't see anything wrong until you get to characters outside that range. In EBCDIC, things get much hairier: it won't get capital letters, nor lowercase letters from r through z (but it will get all the other lowercase letters), nor brackets or braces (though it will get parens).
I'm not sure who the PHB is, but I'm certainly not impressed by anything cryptic in a codebase. Deliberately writing code that's hard to understand should be a firing offense.
Are you saying people should google regular expressions? in my experience (correct me if I'm wrong) that doesn't work, I've never been able to get google to return relevant results even with quotation marks.
I'm saying that usually comments are either wrong or out of date, developers code one regex, comment it, then fix a bug later and don't, then there's a discrepancy between the comment and the code. It's nearly always easier to just google the code and see what it does, if (as in this case) it's not obvious.
Google regex and you'll find plenty of resources including tools to testing patterns. You won't find much for any specific pattern but read the docs and it will be apparent what this regex does. Familiariy and competence with regex is a basic component of being a developer.
Hey, might just be me. I'm usually the 'Ben, can you help me with a regular expression' guy over here, but I stumbled, hard, and failed to connect the '-' with a range of characters (probably because I never thought of 'space to .. something').
So I read the snippet, thought 'Yeah, a character class of space, -, ~' and fell on my face in the next couple of lines.
Yeah, I should've known better, I know how to read it. If .. I invest the time and don't glance over a construct and hope to just get it instantly.
I wouldn't want to see this in a code base without proper documentation (be it a comment, a function name or whatever. Something).
Not that I agree with the "expect people reading your code to Google things" mindset, but to be fair the only ambiguous thing is the ASCII table which is Googleable.
The best code is readable. Readability includes comments. If you're going to comment anything in your code at all, RegExes should be at the very top of that list.
Even if I can figure out what the regex matches (with Google or something else), that doesn't necessarily tell me WHY I'm matching on that particular pattern, or why I needed a RegEx in this spot, or what the intent was at the time of writing it.
The [[:print:]] will match any printable characters like åä, while the [ -~] will not.
I used this once as another safeguard against pushing binary data into the database. It was a poor system to begin with where you even have that possibility... and it happened at least once before the fix and my safeguard was in place.
There will be situations where you need to check specifically for 7-bit ASCII printable characters only. I've worked with APIs that require everything outside that range to be escaped/encoded into it.
Email could be an example, I guess, although I haven't worked with it enough to know whether the whole "7-bits only" thing is still an issue these days.
Jeepers... cut the guy some slack. He didn't say this is the bullet proof way of doing everything YOU want to do in all situations, every time, forever. He said "I thought I'd share my favorite regex of all time". And then explained what it does. Why does everyone have to poop on his favorite thing?
I do dislike people calling that expression a "regex", because it isn't: regular expressions cannot contain backreferences, and must be computable in linear time, whereas primality tests are polynomial.
I agree more with philh's response that there is no alternative term for the true meaning of "regular expression" — a regular language, as suggested by _delirium, is not the same thing.
I suppose I could accept "regex" as not being a regular expression as such, but the two are used so interchangeably that maintaining a distinction isn't very realistic. I'd personally rather a regular expression described a regular language, and "PCRE" (or so) used for the Turing-complete expressions with a similar syntax.
I'm not a big fan of your explanation. To be more precise, true "regular expressions" are computationally equivalent to deterministic finite automata, which indeed can test an n-character string in O(n) time.
It's PCRE (Perl Compatible Regular Expressions) which is one of the most popular dialects of regex. But AFAIK there's isn't a hard and fast RegEx standard.
First, it does not match "prime numbers". It matches composite numbers in unary notation (n is represented by n '1' characters).
The first part (^1?$) allows "" and "1" to match (so that 1 is not detected as a prime).
The second part matches groups of two or more ones (11+?), repeated twice or more, ie products n*m, n ≥ 2, m ≥ 2.
The backreference means that \1 should match the exact same string as the first (11+?). It's different from using (11+?){2,} which would match n_1+n_2+n_3..., n_1 ≥ 2, n_2 ≥ 2, n_3 ≥ 2 (where submatch is independent).
Every time I've had to deal with Unicode and internationalization, it's been a problem.
For example, a few years ago I grabbed a source tarball from somewhere, I forget what or where. It had the author's name in a comment, which included an O with dots over it. That was the only non-ASCII character in the source code. No matter what I did, both Eclipse and command-line javac refused to compile the source.
Finally I wrote a script to delete his name from every source file manually. It compiled flawlessly.
Then there's the time I found some text files with two characters of binary junk at the beginning, followed by completely normal text. Again, I forget what I was doing, but some program was refusing to process them correctly. It was something internationalization-related called the BOM. Eventually I ended up writing a script to walk a directory and remove the first two bytes of every file. (This can probably be done with dd and xargs on UNIX, but I was using Windows at the time, which means that something like this will require spending an hour or so in your favorite programming language.)
These experiences lead me to believe that, for bootstrapped USA startups at least, you shouldn't worry about a market outside the English-speaking world.
If you need to worry about junk like accented characters or moon runes (Chinese/Japanese/Korean characters), it means you're big enough to afford to hire someone specifically to address the problem.
I assume this is a not very subtle troll? Java source is unicode? (The offhand reference to dd and xargs is a bit too much).
How do you define "English-speaking world", btw? Those too ignorant to have heard of non-ascii-characters (ie: excluding Canada, as anyone doing business there should at least have heard of French)?
Anyway, for anyone actually burnt by something similar on a GNU system try looking up recode(1).
And personally I think to exclude all internationalisations because they're harder is a terrible attitude to have. Particularly these days when there's an online tutorials for pretty much any job imaginable (not to mention the numbers of helpful experts willing to give up their time for free on various forums and communities).
> which means that something like this will require spending an hour or so in your favorite programming language
Ok, this is where I stop worrying about how quickly I write code. Did this (removing BOM) quite a few times and it took just a few minutes in Python (under Windows). Heck, this could be two-liner I think :)
I, for one, applaud this attitude. It gives programmers and companies that know what they're doing a leg up over people who couldn't even bother to figure out UTF-8. Natural segmentation of a target market is a good thing.
It's clever, but it's also completely unreadable for anyone who didn't read this article. Regexes have serious maintainability issues as it is; let's not make it worse by putting clever tricks in them.
The main risk in my mind was that it was some sort of control sequence for a feature I hadn't memorized.
I know there's some syntax I can use to create a zero width negative look behind recursive greedy named capture group back reference. Perhaps hyphen-tilde triggers something like that.
Which takes for granted the fact that your input stream is even ASCII to begin with. I'm too lazy to check, but I'm pretty sure this isn't going to catch all printable Unicode characters, for example - and then you're left scratching your head over what the hell the original author was trying to achieve.
Presumably the space, commonly having no meaning in and of itself, could throw you for a moment or two. This isn't a regex `foo_[a-z]`, you have to stop and think about it for a moment.
I don't think it is particularly bad though. It's just not the most trivial of regexes.
Ah, this is my favorite also. If seeing this doesn't make you second guess using a RegExp when a parser is more appropriate, well...you might be a Perl programmer?
I suppose a single regex can be both "favorite" and "worst" at the same time... it's only slightly interesting to know where ~ appears in the ASCII character set, and while someone might recall that space is kinda near the beginning but after the control characters, is it the first helpful printable character? Who knows?
So, do you propose that u.s. bootstrapped startups have a disclaimer on the registration page saying: "you cannot put foreign characters anywhere in our system"?
Even if you focus on u.s., you will have problems. If you're doing a CRM, even u.s. users will put in foreign names from time to time. If you're building a CMS, users may want to put in a quotation in french, or will simply use copy&paste from Word, which replaces "-" with "—"...
I honestly have a hard time finding a u.s. centric startup which could afford to ignore unicode. The support requests, the fires caused by errors, and the disclaimer that you'd have to put on the registration page, would cost much more than simply learning how to code the f'n utf.
Building MVP is good practice in Lean. Saying "I'm bootstrapping hence I don't have the time to learn the programming tools" is just ignorance and incompetence.
It's not like Unicode gives you extra work, it just requires you to learn a few basic concepts. If you try to build a site which doesn't support Unicode, you'll have to put a lots of safeguards everywhere to cover up for your incompetence.
I like to use a variation of this in vim to quickly see if an html doc I'm working on contains weird characters that I might want to replace with &html; entities:
No, you don't need '[' and ']' ? Also you don't need the '&' to get the equivalent of the above, but might have to add support for spaces?
> \p{L} or \p{Letter}: any kind of letter from any language.
vs
> \p{L&} or \p{Letter&}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
Along with:
> \p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
Considering the op matches everything printable, including whitespace (or actually just space, not tab), numbers and punctation, I think the equivalent would be "\X" ?
Wouldn't it be smarter, in that case, match something like [\x20-\x7F] (unsure if that's valid regex, but you get the point), it's more explicit that way, being very obvious about what characters are included, as well as immediately being clear (to me) about the intention of the character class. "0x20 to 0x7F" triggers the idea of "Printable ASCII" a lot sooner than <space> to <tilde>.
I'm sorry that was not the purpose. I just wanted to share this regex trick that I had in my mind. I added the shirts only later when I saw that the article is getting very popular. People really seemed to like my previous tees and it's great to make a little extra money and continue doing what I love - coding and writing blog posts.
It's pretty simple. Assuming you know regex... Im going to assume you don't since you are asking.
The bracket expression [ ] defines single characters to match, however you can have more then 1 character inside which all will match.
[a] matches a
[ab] matches either a or b
[abc] matches either a or b or c
[a-c] matches either a or b or c.
The - allows us to define the range. You can just as easily use [abc] but for long sequences such as [a-z] consider it short hand.
In this case [ -~] it means every character between <space> and <tilde>, which just happens to be all the ASCII printable characters (see chart in the article). The only bit you need to keep in mind is that <space> is a character as well, and hence you can match on it.
You could rewrite the regex like so (note I haven't escaped or anything in this so its probably not valid)
It doesn't. Space is significant here, and if '-' is a the front of the matching character class it matches literal '-'. Your regex '[- ~]' matches either '-' or ' ' or '~'.
[ -~] #match only printable characters
It takes 5 seconds longer and with regexes, just knowing what the damn thing is trying to do is half the battle. When you use a regex, use a comment. Its the civil thing to do.