问题
PHP Regular expression fails when non UTF 8 character found!
I need to strip 40,000 database records to grab a width and height value from a custom_size mysql table field.
The filed is in all sorts of different random formats.
The most reliable way is to grab a numeric value from the left and right side of an x and strip all non numeric values from them.
The code below works pretty good 99% of the time until it found a few records with non UTF 8 characters.
31*32 and 35”x21” are 2 examples.
When these are ran I get these PHP errors and script halts....
Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 1683977065 on line 21
Warning: preg_match(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 on line 24
Demo:
<?php
$strings = array(
'12x12',
'172.61 cm x 28.46 cm',
'31"x21"',
'1"x1"',
'31*32',
'35”x21”'
);
foreach($strings as $string){
if($string != ''){
$string = str_replace('”','"',$string);
// Strip out all characters except for numbers, letter x, and decimal points
$string = preg_replace( '/([^0-9x\.])/ui', '', strtolower( $string ) );
// Find anything that fits the number X number format
preg_match( '/([0-9]+(\.[0-9]+)?)x([0-9]+(\.[0-9]+)?)/ui', $string, $values );
echo 'Original value: ' .$string.'<br>';
echo 'Width: ' .$values[1].'<br>';
echo 'Height: ' .$values[3].'<br><hr><br>';
}
}
Any ideas around this? I cannot rebuild server software to add support
Just found an answer with a PHP library to convert to UTF8 that seems to be helping a lot https://stackoverflow.com/a/3521396/143030
回答1:
By default, the PCRE regex-engine reads a character string one byte at a time, so, by default it ignores byte sequences that may compose a single character when a multibyte encoding like UTF-8 is in use, and see them as separated bytes (one byte, one character).
For example, the character U+201D: RIGHT DOUBLE QUOTATION MARK uses three bytes in UTF-8:
$a = '”';
for ($i=0; $i < strlen($a); $i++) {
echo dechex(ord($a[$i])), ' ';
}
Result:
e2 80 9d
To enable the multibyte read in the PCRE regex engine, you can either use one of these directives at the beginning of the pattern: (*UTF), (*UTF8), (*UTF16), (*UTF32) or the u modifier (that switches on the available multi-bytes mode, but that extends too the meaning of the shorthand character classes like \s, \d, \w... to unicode. In other words the u modifier is a shortcut for (*UTFx) and (*UCP) that changes the character classes.)
But these features are only available if the PCRE module has been compiled with the support of these encodings. (This is the case for most of the default PHP installations, but it isn't absolutely systematic or mandatory.)
It seems that it isn't the case for you since when you use the u modifier, you obtain this explicit message:
this version of PCRE is not compiled with PCRE_UTF8 support
You can't do anything against that except if you decide to change your PHP installation by one with the PCRE module compiled with UTF8 support.
However, it isn't really a problem in your case, because in your patterns the u modifier is totally useless even if your input is UTF8 encoded.
The reason is that your two patterns use only ASCII literal characters (characters that are in the 00-7F range) and because characters beyond the ASCII range in the UTF8 encoding never use bytes from this range:
Unicode char UTF8 Name
--------------------------------------------------------
U+007D } 7d RIGHT CURLY BRACKET
U+007E ~ 7e TILDE
U+007F 7f <control>
U+0080 c2 80 <control>
U+0081 c2 81 <control>
...
U+00BE ¾ c2 be VULGAR FRACTION THREE QUARTERS
U+00BF ¿ c2 bf INVERTED QUESTION MARK
U+00C0 À c3 80 LATIN CAPITAL LETTER A WITH GRAVE
U+00C1 Á c3 81 LATIN CAPITAL LETTER A WITH ACUTE
...
So you can write:
$string = preg_replace( '/[^0-9x.]+/', '', strtolower( $string ) );
(No need to use the i modifier since your string is already lowercase. No need to escape a dot in a character class and to use a capture group. Adding the + quantifier speeds up the replacement since several consecutive characters are removed in one replacement, instead of one by one.)
and:
if (preg_match('/([0-9]+(?:\.[0-9]+)?)x([0-9]+(?:\.[0-9]+)?)/', $string, $values)) {
echo 'Original value: ', $string, '<br>';
echo 'Width: ', $values[1], '<br>';
echo 'Height: ', $values[2], '<br><hr><br>';
}
However, it can be dangerous with some patterns, for example this will not remove the first character as expected if this one is encoded with several bytes, but only the first byte of this character:
$a = preg_replace('/^./', '', '”abc');
for ($i=0; $i < strlen($a); $i++) {
echo ' ', dechex(ord($a[$i]));
}
returns:
80 9d 61 62 63
# � � a b c
来源:https://stackoverflow.com/questions/31396590/php-preg-replace-fails-when-a-non-utf8-character-is-detected