is PHP str_word_count() multibyte safe?

前端 未结 4 1536
Happy的楠姐
Happy的楠姐 2020-11-29 09:23

I want to use str_word_count() on a UTF-8 string.

Is this safe in PHP? It seems to me that it should be (especially considering that there is no mb_str_word_co

4条回答
  •  不知归路
    2020-11-29 10:03

    I'd say you guess right. And indeed there are space characters in UTF-8 which are not part of US-ASCII. To give you an example of such spaces:

    • Unicode Character 'NO-BREAK SPACE' (U+00A0): 2 Bytes in UTF-8: 0xC2 0xA0 (c2a0)

    And perhaps as well:

    • Unicode Character 'NEXT LINE (NEL)' (U+0085): 2 Bytes in UTF-8: 0xC2 0x85 (c285)
    • Unicode Character 'LINE SEPARATOR' (U+2028): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)
    • Unicode Character 'PARAGRAPH SEPARATOR' (U+2029): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)

    Anyway, the first one - the 'NO-BREAK SPACE' (U+00A0) - is a good example as it is also part of Latin-X charsets. And the PHP manual already provides a hint that str_word_count would be locale dependent.

    If we want to put this to a test, we can set the locale to UTF-8, pass in an invalid string containing a \xA0 sequence and if this still counts as word-breaking character, that function is clearly not UTF-8 safe, hence not multibyte safe (as same non-defined as per the question):

    Output:

    New Locale: en_US.utf8
    
    array(3) {
      [0]=>
      string(5) "aword"
      [6]=>
      string(5) "bword"
      [12]=>
      string(5) "aword"
    }
    

    As this demo shows, that function totally fails on the locale promise it gives on the manual page (I do not wonder nor moan about this, most often if you read that a function is locale specific in PHP, run for your life and find one that is not) which I exploit here to demonstrate that it by no means does anything regarding the UTF-8 character encoding.

    Instead for UTF-8 you should take a look into the PCRE extension:

    • Matching Unicode letter characters in PCRE/PHP

    PCRE has a good understanding of Unicode and UTF-8 in PHP in specific. It can also be quite fast if you craft the regular expression pattern carefully.

提交回复
热议问题