I was reading this article today on two different regular expression algorithms.
According to the article old Unix tools like ed, sed, grep, egrep, awk, and lex, a
Perl uses a memoized recursive backtracking search and, as of some improvements in 5.10, no longer blows up on perl -e '("a" x 100000) =~ /^(ab?)*$/;'
. In recent tests I performed on an OS X box, Perl 5.10 outperformed awk
, even in the cases where awk
's algorithm was supposed to be better.
The Javascript ECMA language description doesn't impose a requirement for the particular implementation of regular expressions, so that part of the question isn't well-formed. You're really wondering about the particular implementation in a particular browser.
The reason Perl/Python etc use a slower algorithm, though, is that the regex language defined isn't really regular expressions. A real regular expression can be expressed as a finite state machine, but the language of regex is context free. That's why the fashion is to just call it "regex" instead of talking about regular expressions.
Yes, in fact javascript regex isn't content free regular. Consider the syntax using `{n,m}', that is, matches from n to m accepted regexs. Let d the difference d=|n-m|. The syntax means there exists a string uxdw that is acceptable, but a string uxk>dw that is not. It follows via the pumping lemma for regular languages that this is not a regular language.
(augh. Thinko corrected.)
Though the ECMA standard does not specify the algorithm an ECMAScript implementation should use, the fact that the standard mandates that ECMAScript regular expressions must support backreferences (\1, \2, etc.) rules out the DFA and "Thompson NFA" implementations.