I would like to use a regular expression to validate user input. I want to allow any combination of letters, numbers, spaces, commas, apostrophes, periods, exclamation marks
Looking at the 'regex' engine php uses, pcre here: http://pcre.sourceforge.net/pcre.txt at the limitations section it states:
The maximum length of a compiled pattern is 65539 (sic)
bytes
My guess is that some regex like this:
(123){1,3}
is compiled to something like this
(123)(123)?(123)?
Which makes it bigger than the maximum length
While I agree that the regex compiler shouldn't behave that way, you really shouldn't have encountered this problem. Inside the parentheses, your regex matches exactly one character from a specific set--the definition of a character class. The correct way to write your regex is to list all the characters inside one set of square brackets and forego the parentheses:
/^[a-z0-9 ,'.!?]{1,4000}$/i
That works fine, as this demo shows. However, it was the parentheses that were causing the error (even non-capturing parens cause it), and that doesn't seem right to me, even if they were unnecessary.
The error is due to its LINK_SIZE
, with offset values limiting the compiled pattern size to 64K. This is an expected behavior, explained below, and it's not because of a limit in repetition nor how the pattern is interpreted when compiled.
As Alan Moore pointed out in his answer, all characters should be in the same character class. I'm more drastic, so allow me to say that pattern is so wrong it makes me cringe.
-No offense, most of us tried that once too. It's just an attempt to underline that in no way such constructs should be used.
There are 3 common pitfalls here in (x|y|z){1,4000}
:
|
s) adds backtracking states. It's a good practice to try to reduce them as much as you can. In this case, the regex ^[ !',.0-9?A-Z]{1,4000}$/i
, would match exactly the same, not only avoiding the error, but also proving better performance.From "Handling Very Large Patterns" in pcrebuild man page:
Within a compiled pattern, offset values are used to point from one part to another (for example, from an opening parenthesis to an alternation metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values are used for these offsets, leading to a maximum size for a compiled pattern of around 64K.
That means the compiled pattern stores an offset value for every subpattern in the alternation, for every repetition of the group. In this case the offsets leave no memory for the rest of the compiled pattern.
This is more clearly expressed in a comment in pcre_internal.h from the PHP dist:
PCRE keeps offsets in its compiled code as 2-byte quantities (always stored in big-endian order) by default. These are used, for example, to link from the start of a subpattern to its alternatives and its end. The use of 2 bytes per offset limits the size of the compiled regex to around 64K, which is big enough for almost everybody.
Using pcretest, I get the following information:
PCRE version 8.37 2015-04-28
/^([a-z]|[0-9]| |,|'|\.|!|\?){1,575}$/i
Failed: regular expression is too large at offset 36
/^([a-z]|[0-9]| |,|'|\.|!|\?){1,574}$/i
Memory allocation (code space): 65432
Regarding other size limitations in PCRE, you can check this post of mine.
LINK_SIZE
in PHPIf we had a true reason to use a huge pattern, and this pattern could not be simplified any further by all means, the link size could be increased. However, you can only achieve this by recompiling PHP yourself (therefore, your code won't be portable from now on). It should be the last resort, provided there's no other choice.
Also commented in pcre_internal.h:
The macros are controlled by the value of
LINK_SIZE
. This defaults to 2 in the config.h file, but can be overridden by using-D
on the command line. This is automated on Unix systems via the "configure" command.
PCRE link size can be configured to 3 or 4:
./configure -DLINK_SIZE=4
But keep in mind that longer offsets require additional data, and it will slow down all calls to preg_* functions.
In case of compiling PHP on your own, see Installation on Unix systems or Build your own PHP on Windows.