PHP: Split multibyte string (word) into separate characters

耗尽温柔 提交于 2019-11-28 08:26:09

try a regular expression with 'u' option, for example

  $chars = preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);

An ugly way to do it is:

mb_internal_encoding("UTF-8"); // this IS A MUST!! PHP has trouble with multibyte
                               // when no internal encoding is set!
$string = ".....";
$chars = array();
for ($i = 0; $i < mb_strlen($string); $i++ ) {
    $chars[] = mb_substr($string, $i, 1); // only one char to go to the array
}

You should also try your way with mb_split with setting the internal_encoding before it.

masakielastic

You can use grapheme functions (PHP 5.3 or intl 1.0) and IntlBreakIterator (PHP 5.5 or intl 3.0). The following code shows the diffrence among intl and mbstring and PCRE functions.

// http://www.php.net/manual/function.grapheme-strlen.php
$string = "a\xCC\x8A"  // 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5)
         ."o\xCC\x88"; // 'LATIN SMALL LETTER O WITH DIAERESIS'  (U+00F6)

$expected = ["a\xCC\x8A", "o\xCC\x88"];
$expected2 = ["a", "\xCC\x8A", "o", "\xCC\x88"];

var_dump(
    $expected === str_to_array($string),
    $expected === str_to_array2($string),
    $expected2 === str_to_array3($string),
    $expected2 === str_to_array4($string),
    $expected2 ===  str_to_array5($string)
);

function str_to_array($string)
{
    $length = grapheme_strlen($string);
    $ret = [];

    for ($i = 0; $i < $length; $i += 1) {
        $ret[] = grapheme_substr($string, $i, 1);
    }

    return $ret;
}

function str_to_array2($string)
{
    $it = IntlBreakIterator::createCharacterInstance('en_US');
    $it->setText($string);

    $ret = [];
    $prev = 0;

    foreach ($it as $pos) {

        $char = substr($string, $prev, $pos - $prev);

        if ('' !== $char) {
           $ret[] = $char;
        }

        $prev = $pos;
    }

    return $ret;
}

function str_to_array3($string)
{
    $it = IntlBreakIterator::createCodePointInstance();
    $it->setText($string);

    $ret = [];
    $prev = 0;

    foreach ($it as $pos) {

        $char = substr($string, $prev, $pos - $prev);

        if ('' !== $char) {
           $ret[] = $char;
        }

        $prev = $pos;
    }

    return $ret;
}

function str_to_array4($string)
{
    $length = mb_strlen($string, "UTF-8");
    $ret = [];

    for ($i = 0; $i < $length; $i += 1) {
        $ret[] = mb_substr($string, $i, 1, "UTF-8");
    }

    return $ret;
}

function str_to_array5($string) {
    return preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);
}

When working on production environment, you need to replace invalid byte sequence with the substitute character since almost all grapheme and mbstring functions can't handle invalid byte sequence. If you have an interest, see my past answer: https://stackoverflow.com/a/13695364/531320

If you don't take of perfomance, htmlspecialchars and htmlspecialchars_decode can be used. The merit of this way is supporting various encoding other than UTF-8.

function str_to_array6($string, $encoding = 'UTF-8')
{
    $ret = [];
    str_replace_callback($string, function($char, $index) use (&$ret) { $ret[] = $char; return ''; }, $encoding);
    return $ret;
}

function str_replace_callback($string, $callable, $encoding = 'UTF-8')
{
    $str_size = strlen($string);
    $string = str_scrub($string, $encoding);

    $ret = '';
    $char = '';
    $index = 0;

    for ($pos = 0; $pos < $str_size; ++$pos) {

        $char .= $string[$pos];

        if (str_check_encoding($char, $encoding)) {

            $ret .= $callable($char, $index);
            $char = '';
            ++$index;
        }

    }

    return $ret;
}

function str_check_encoding($string, $encoding = 'UTF-8')
{
    $string = (string) $string;
    return $string === htmlspecialchars_decode(htmlspecialchars($string, ENT_QUOTES, $encoding));
}

function str_scrub($string, $encoding = 'UTF-8')
{
    return htmlspecialchars_decode(htmlspecialchars($string, ENT_SUBSTITUTE, $encoding));
}

If you want to learn the specification of UTF-8, the byte manipulation is the good way to practice.

function str_to_array6($string)
{
    // REPLACEMENT CHARACTER (U+FFFD)
    $substitute = "\xEF\xBF\xBD";
    $size = strlen($string);
    $ret = [];

    for ($i = 0; $i < $size; $i += 1) {

        if ($string[$i] <= "\x7F") {

            $ret[] = $string[$i];

        } elseif ("\xC2" <= $string[$i] && $string[$i] <= "\xDF")  {

            if (!isset($string[$i+1])) {

                $ret[] = $substitute;
                return $ret;

            } elseif ($string[$i+1] < "\x80" || "\xBF" < $string[$i+1]) {

                $ret[] = $substitute;

            } else {

                $ret[] = substr($string, $i, 2);
                $i += 1;

            }

        } elseif ("\xE0" <= $string[$i] && $string[$i] <= "\xEF") {

            $left = "\xE0" === $string[$i] ? "\xA0" : "\x80";
            $right = "\xED" === $string[$i] ? "\x9F" : "\xBF";

            if (!isset($string[$i+1])) {

                $ret[] = $substitute;
                return $ret;

            } elseif ($string[$i+1] < $left || $right < $string[$i+1]) {

                $ret[] = $substitute;

            } else {

                if (!isset($string[$i+2])) {

                    $ret[] = $substitute;
                    return $ret;

                } elseif ($string[$i+2] < "\x80" || "\xBF" < $string[$i+2]) {

                    $ret[] = $substitute;
                    $i += 1;

                } else {

                    $ret[] = substr($string, $i, 3);
                    $i += 2;

                }

            }

        } elseif ("\xF0" <= $string[$i] && $string[$i] <= "\xF4") {

            $left = "\xF0" === $string[$i] ? "\x90" : "\x80";
            $right = "\xF4" === $string[$i] ? "\x8F" : "\xBF";

            if (!isset($string[$i+1])) {

                $ret[] = $substitute;
                return $ret;

            } elseif ($string[$i+1] < $left || $right < $string[$i+1]) {

                $ret[] = $substitute;

            } else {

                if (!isset($string[$i+2])) {

                    $ret[] = $substitute;
                    return $ret;

                } elseif ($string[$i+2] < "\x80" || "\xBF" < $string[$i+2]) {

                    $ret[] = $substitute;
                    $i += 1;

                } else {

                    if (!isset($string[$i+3])) {

                        $ret[] = $substitute;
                        return $ret;

                    } elseif ($string[$i+3] < "\x80" || "\xBF" < $string[$i+3]) {

                        $ret[] = $substitute;
                        $i += 2;

                    } else {

                        $ret[] = substr($string, $i, 4);
                        $i += 3;

                    }

                }

            }

        } else {

            $ret[] = $substitute;

        }

    }

    return $ret;

}

The result of benchmark between these functions is here.

grapheme
0.12967610359192
IntlBreakIterator::createCharacterInstance
0.17032408714294
IntlBreakIterator::createCodePointInstance
0.079245090484619
mbstring
0.081080913543701
preg_split
0.043133974075317
htmlspecialchars
0.25599694252014
byte maniplulation
0.13132810592651

The benchmark code is here.

$string = '主楼怎么走';

foreach (timer([
    'grapheme' => 'str_to_array',
    'IntlBreakIterator::createCharacterInstance' => 'str_to_array2',
    'IntlBreakIterator::createCodePointInstance' => 'str_to_array3',
    'mbstring' => 'str_to_array4',
    'preg_split' => 'str_to_array5',
    'htmlspecialchars' => 'str_to_array6',
    'byte maniplulation' => 'str_to_array7'
],
[$string]) as $desc => $time) {

  echo $desc, PHP_EOL,
       $time, PHP_EOL; 
}

function timer(array $callables, array $arguments, $repeat = 10000) {

    $ret = [];
    $save = $repeat;

    foreach ($callables as $key => $callable) {

        $start = microtime(true);

        do {

            array_map($callable, $arguments);

        } while($repeat -= 1);

        $stop = microtime(true);
        $ret[$key] = $stop - $start;
        $repeat = $save;

    }

    return $ret;
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!