Convert a String into an Array of Characters - multi-byte

前端 未结 2 604
刺人心
刺人心 2020-12-07 02:48

Assuming that in 2019 every solution which is not UNICODE-safe is wrong. What is the best way to convert a string to array of UNICODE characters in PHP?

Obviously th

相关标签:
2条回答
  • 2020-12-07 03:23

    Just pass an empty pattern with the PREG_SPLIT_NO_EMPTY flag. Otherwise, you can write a pattern with \X (unicode dot) and \K (restart fullstring match). I'll include a mb_split() call and a preg_match_all() call for completeness.

    Code: (Demo)

    $string='先秦兩漢';
    var_export(preg_split('~~u', $string, 0, PREG_SPLIT_NO_EMPTY));
    echo "\n---\n";
    var_export(preg_split('~\X\K~u', $string, 0, PREG_SPLIT_NO_EMPTY));
    echo "\n---\n";
    var_export(preg_split('~\X\K(?!$)~u', $string));
    echo "\n---\n";
    var_export(mb_split('\X\K(?!$)', $string));
    echo "\n---\n";
    var_export(preg_match_all('~\X~u', $string, $out) ? $out[0] : []);
    

    All produce::

    array (
      0 => '先',
      1 => '秦',
      2 => '兩',
      3 => '漢',
    )
    

    From https://www.regular-expressions.info/unicode.html:

    How to Match a Single Unicode Grapheme

    Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use \X.

    You can consider \X the Unicode version of the dot. There is one difference, though: \X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.


    UPDATE, DHarman has brought to my attention that mb_str_split() is now available from PHP7.4.

    The default length parameter of the new function is 1, so the length parameter can be omitted for this case.

    https://wiki.php.net/rfc/mb_str_split

    Dharman's demo: https://3v4l.org/M85Fi/rfc#output

    0 讨论(0)
  • 2020-12-07 03:48

    This works for me, it explodes a unicode string into an array of characters:

    //
    // split at all position not after the start: ^
    // and not before the end: $, with unicode modifier
    // u (PCRE_UTF8).
    //
    $arr = preg_split("/(?<!^)(?!$)/u", $text);
    

    For example:

    <?php
    //
    $text = "堆栈溢出";
    
    $arr = preg_split("/(?<!^)(?!$)/u", $text);
    
    echo '<html lang="fr">
    <head>
    <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
    </head>
    <body>
    ';
    
    print_r($arr);
    
    echo '</body>
    </html>
    ';
    ?>
    

    In a browser, it produces this:

    Array ( [0] => 堆 [1] => 栈 [2] => 溢 [3] => 出 )
    
    0 讨论(0)
提交回复
热议问题