preg_match unicode parsing

落花浮王杯 提交于 2019-12-11 08:33:17

问题


I want to match a sub set of unicode/UTF-8 chars, (marked in yellow here http://solomon.ie/unicode/), from my research I came up with this:

// ensure it's valid unicode / get rid of invalid UTF8 chars
$text = iconv("UTF-8","UTF-8//IGNORE",$text);

// and just allow a basic english...ish.. chars through - no controls, chinese etc
$match_list = "\x{09}\x{0a}\x{0d}\x{20}-\x{7e}"; // basic ascii chars plus CR,LF and TAB 
$match_list .= "\x{a1}-\x{ff}"; // extended latin 1 chars excluding control chars
$match_list .= "\x{20ac}"; // euro symbol

if (preg_match("/[^$match_list]/u", $text) )
    $error_text_array[] = "<b>INVALID UNICODE characters</b>";

Testing seems to show it works as expected, but as a newbie to uniocde I'd be grateful if anyone here can spot any vulnerabilities I've overlooked.

Can I confirm that the hex ranges are matching unicode code points as opposed to the actual hex value (ie x20ac instead of xe282ac for the Euro symbol is correct)?

And can I mix literal characters and hex values like preg_match("/[^0-9\x{20ac}]/u", $text)?

Thanks, Kevin

Note, I tried this question before but it was closed off - "better suited to codereview.stackexchange.com", but no response there so hope it's ok to try again in a much more concise format.


回答1:


I created a wrapper to test your code and I think it is secure in filtering characters you expected but your code will cause E_NOTICE when it found invalid UTF-8 characters. So I think you should add @ in the beginning of iconv line to suppress notices.

For the second question, it is ok to mix literal characters and hex values. You can also try that by yourself too. :)

<?php
function generatechar($char)
{
    $char = str_pad(dechex($char), 4, '0', STR_PAD_LEFT);
    $unicodeChar = '\u'.$char;
    return json_decode('"'.$unicodeChar.'"');
}
function test($text)
{   
    // ensure it's valid unicode / get rid of invalid UTF8 chars
    @$text = iconv("UTF-8","UTF-8//IGNORE",$text); //Add @ to surpress warning
    // and just allow a basic english...ish.. chars through - no controls, chinese etc
    $match_list = "\x{09}\x{0a}\x{0d}\x{20}-\x{7e}"; // basic ascii chars plus CR,LF and TAB
    $match_list .= "\x{a1}-\x{ff}"; // extended latin 1 chars excluding control chars
    $match_list .= "\x{20ac}"; // euro symbol

    if (preg_match("/[^$match_list]+/u", $text)  )
        return false;

    if(strlen($text) == 0)
        return false; //For testing purpose!
    return true;
}

for($n=0;$n<65536;$n++)
{
    $c = generatechar($n);
    if(test($c))
        echo $n.':'.$c."\n";
}


来源:https://stackoverflow.com/questions/10365160/preg-match-unicode-parsing

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!