Regexps are a security leak?
This is a -very- technical post, so if you aren’t a programmer you may not be able to follow along.
Regular Expressions are used in virtually all webservices. mod_rewrite, a very popular apache plugin, uses them. Django, a popular Python web framework, uses regexps to map URLs onto the code that can handle the requests. Perl is virtually built on regular expressions. Virtually all languages popular for web development support regexp parsing.
Unfortunately, certain regular expressions have what I call ‘runaway nature’. A regexp with ‘runaway nature’ has the following property:
There exists at least 1 input string which will cause the act of matching this input string against the regexp to take a very long time.
Simple example: Given the regexp (x+x+)+y and the input string xxxxxxxxxxxxxxxxxxxx, most regexp parsers just hang. Smart ones realize this can’t work (as all matching strings must end in a y, but the input string does not. Unfortunately most aren’t that intelligent). Turns out on e.g. the C# regexp parser, an average powerful machine needs 25 SECONDS to realize that the input does not match the output. See This codinghorror article on the details of this particular case. Clearly the regexp (x+x+)+y has runaway nature, at least on the C# regexp parser.
There are many regexps which have ‘runaway nature’ on only certain platforms. However, no implementation of a regexp parser that I know of is completely immune to ‘runaway nature’ - some regexp strings just implicitly have it, regardless of implementation.
This is a security leak; causing one of the CPU cores of a webserver to hang for 25 seconds makes it totally trivial to crash the server; this is known as a Denial of Service attack. No data is compromised, but the server just stops working.
There are 2 ways this issue can be fixed, that I can see.
- Determine if it is possible for a machine to determine in constant time if a certain regexp pattern has ‘runaway nature’, and generate a warning if this is true. This allows web programmers to be warned in advance that they have a security risk.
- When running a string against a pattern, allow the programmer to specify a ‘limit’. Once the regexp parser backtracks that many times, it just quits and throws an error instead of getting bogged down. By choosing a careful limit, a web programmer can trade off ‘correctness’ against server security. I get the feeling that any input string that causes runaway performance troubles is very likely to be an invalid usecase anyway.
Unfortunately, neither fix is available as standard solution in any mainstream programming language that I know of.
I’m not sure how large this problem really is but I can imagine there are lots and lots of webservices out there which can be brought to a grinding halt by feeding it the right (wrong) input.
NB: This issue crossed my mind when I crafted the following regexp to check if an input string appears to be a URL. I’m not sure if this regexp has ‘runaway nature’. If you’re a real regexp guru and can figure this out, or if you spot any errors, help me out and let me know in the comments! Thanks a lot!
^([hH][tT][tT][pP][sS]?://)?((?:[a-zA-Z0-9][a-zA-Z0-9-]*?[a-zA-Z0-9]?)(?:\.[a-zA-Z0-9][a-zA-Z0-9-]*?[a-zA-Z0-9]?)+)(:\d+)?(/[\w/\.;\?:\&=+\$,#]*)?$”
(1,2,3,4 stands for: protocol, server, port, path string).
Cristiano Betta http://cristianobetta.com/
July 25th, 2007I might be weird, but isn’t there a 4th option: disallowing certain regexs? I know this is probably not the best solution, but of two regexs exists and one has a potential runaway and the other doesn’t, than why not ban the one.
masklinn
July 25th, 2007> no implementation of a regexp parser that I know of is completely immune to ‘runaway nature’
DFA parsers probably are, but they’re rarely used as they have a very low amount of advanced constructs (e.g. the PCRE “regexp” isn’t a regular language, and thus cannot be fully expressed by a DFA). Which means that any RE engine that would be immune to this behavior usually won’t be featureful enough to be used.
If you haven’t read Friedl’s “Mastering Regular Expressions” yet, I strongly suggest you do, it’s an awesome book on the subject.
Tet
July 25th, 2007You can read all about why regexps are often slow, and why they needn’t be here:
http://swtch.com/~rsc/regexp/regexp1.html
Max
July 25th, 2007DFAs do not have the runaway nature.
As far as I know, the features of PCREs that make them non-DFA-able are lookaheads and lookbacks (and possibly non-greediness). A quick glance indicates that you’re not using any of those in your examples, and I don’t think they’re nearly as common as regexen without them.
Nevertheless, I’m not sure if this is a real hole, since it is regexen and not matching candidates which have the runaway nature. So a problem only really arises if you accept regexen from untrusted sources. Like Google codesearch. But codesearch does not allow PCREs, and seems to use a sub-exponential matching algorithm as searching /(x+x+)+y/ is no slower (it seems) than /foo/
Bart
July 25th, 2007The reason why /(x+x+)+y/ has a runaway nature is because of the nested quantifiers: there’s a “+” around a group that contains “+” itself.
Perl has a built-in feature to protect against such regexes, and that is the “cut operator”, written like (?>PATTERN), which prevents backtracking. You gennerally apply it to the inner group, to prevent endless retries inside the group. See japhy’s draft of his regex book, section 8.1.4, at http://japhy.perlmonk.org/book/book/ (MS Word format).
A pattern using the cut operator can match less than the unrestricted pattern does; the responsibility for proper working is back in the hands of the programmer.
And if you still want capturing, you need to add an extra pair of parens.
John Reese http://leetcode.net
July 25th, 2007Why are you complaining about the “insecurity” of regular expressions, and then giving a totally contrived and meaingingless example? Why on Earth would you use the regex (x+x+)+y? That’s just begging for a non-deterministic outcome.
Your point would have been a little easier to agree with if you could give a *real world* example of where you would use a regex that had non-determinism in it, and then explained why you would be unable to find a deterministic replacement for it instead.
One of the biggest benefits of using a regex is that you can express what you want in many different ways. It should be up to the programmer to know what he/she is doing well enough to understand the implications of using any given regex. Why go through the work to ‘ban’ certain regex patterns, when you could simply write a post *educating* people on why they should pay attention to and audit their patterns before they put them in use.
Reinier http://zwitserloot.com
July 26th, 2007@John Reese: Simple. There’s a lot of fuss made about OSes that ship with a security hole wide open as a ‘default’. Your argument (Why make a fuss, just educate the users) doesn’t really work, because as it turns out if there’s no big giant warning buzzer going off, people don’t realize there’s a problem.
If it IS really difficult to construct a regex with runaway nature, then it’s probably easy for the vast majority of regexes to determine (quickly) that it definitely does NOT have runaway nature.
Why aren’t there completely automated, integrated tools out there that just plain generate a log warning if you even try to work with a regex that (might) have runaway nature?
Or, probably much simpler: Why can’t I tell a regex parser to give up well before CPU hogging becomes a problem? It should be trivial for a regex parser to keep track of how many times it backtracked.
axel http://www.wuenschenswert.net
July 26th, 2007Some of the advanced features of PCREs, such as backreferences or context, cause them to slip into a more complex language class than regular in the computer science sense. I am just guessing, but this might cause the static analysis to fall victim to the “runaway” problem itself (i.e. compiling the regexp might take 25 seconds).
There are regexps that, when represented as a DFA, have a number of states exponential in the length of the regexp, which necessarily takes time and space.
We were all quite surprised that matching times only degraded by a factor of 2 - a price you’re willing to pay for reasonable turnaround times and startup time.
Some lore from an old bore: In the old days, we used a regexp package that would compile all regexps into DFAs. As the program in question (a parser for small advertisements for cars) had a huge number of regexps, the startup/compilation time was up to an hour. So they hired a cheap student (me) to implement a lazy NFA-to-DFA compiler (following Thompsons paper, as cited by Tet above), that would only materialize those states actually encountered during parsing. Problem solved, and without backtracking
A timeout is a practical but poor solution. The regexp implementation might try the obvious and simple solution last, and never get there because it gets stuck in some silly backtracking. So even though formally, your input matches the regexp, practically it does not, or only when the system load is low and the CPU is fast (e.g. during tests!).
Arbitrarily pruning the search tree (by timeout, as suggested by the cut operator) makes it almost impossible for a programmer to judge which strings exactly will get matched, other by mentally executing the implementation of that regexp package. That’s not a programming model.
San
July 26th, 2007I tested that ‘runaway’ regex of yours in Tiger Java on a Pentium IV machine, 512 MB RAM and it came up with an answer in under one second…
Reinier http://zwitserloot.com
July 26th, 2007@San: As mentioned, there are many ways to treat regexp (x+x+)+y so that it does respond quickly. By trying the regexp from both ends, the lack of an ending ‘y’ is a very quick check. Also, by ‘pre-compiling’ the regexp, it can be reduced to x{2,}y which parses very quickly.
However, there are other regexes (especially with group matchers) which have runaway nature implicitly, no amont of tricks will solve that. There are alternative regex notations which avoid this issue altogether (by making it impossible to write a regex with runaway nature in them) but those aren’t standard, and they are slightly less capable. (There are certain things they cannot do).