PHP mb_split(), capturing delimiters

不打扰是莪最后的温柔 提交于 2021-02-09 11:55:41

问题


preg_split has an optional PREG_SPLIT_DELIM_CAPTURE flag, which also returns all delimiters in the returned array. mb_split does not.

Is there any way to split a multibyte string (not just UTF-8, but all kinds) and capture the delimiters?

I'm trying to make a multibyte-safe linebreak splitter, keeping the linebreaks, but would prefer a more genericaly usable solution.

Solution Thanks to user Casimir et Hippolyte, I built a solution and posted it on github (https://github.com/vanderlee/PHP-multibyte-functions/blob/master/functions/mb_explode.php), which allows all the preg_split flags:

/**
 * A cross between mb_split and preg_split, adding the preg_split flags
 * to mb_split.
 * @param string $pattern
 * @param string $string
 * @param int $limit
 * @param int $flags
 * @return array
 */
function mb_explode($pattern, $string, $limit = -1, $flags = 0) {       
    $strlen = strlen($string);      // bytes!   
    mb_ereg_search_init($string);

    $lengths = array();
    $position = 0;
    while (($array = mb_ereg_search_pos($pattern)) !== false) {
        // capture split
        $lengths[] = array($array[0] - $position, false, null);

        // move position
        $position = $array[0] + $array[1];

        // capture delimiter
        $regs = mb_ereg_search_getregs();           
        $lengths[] = array($array[1], true, isset($regs[1]) && $regs[1]);

        // Continue on?
        if ($position >= $strlen) {
            break;
        }           
    }

    // Add last bit, if not ending with split
    $lengths[] = array($strlen - $position, false, null);

    // Substrings
    $parts = array();
    $position = 0;      
    $count = 1;
    foreach ($lengths as $length) {
        $is_delimiter   = $length[1];
        $is_captured    = $length[2];

        if ($limit > 0 && !$is_delimiter && ($length[0] || ~$flags & PREG_SPLIT_NO_EMPTY) && ++$count > $limit) {
            if ($length[0] > 0 || ~$flags & PREG_SPLIT_NO_EMPTY) {          
                $parts[]    = $flags & PREG_SPLIT_OFFSET_CAPTURE
                            ? array(mb_strcut($string, $position), $position)
                            : mb_strcut($string, $position);                
            }
            break;
        } elseif ((!$is_delimiter || ($flags & PREG_SPLIT_DELIM_CAPTURE && $is_captured))
               && ($length[0] || ~$flags & PREG_SPLIT_NO_EMPTY)) {
            $parts[]    = $flags & PREG_SPLIT_OFFSET_CAPTURE
                        ? array(mb_strcut($string, $position, $length[0]), $position)
                        : mb_strcut($string, $position, $length[0]);
        }

        $position += $length[0];
    }

    return $parts;
}

回答1:


Capturing delimiters is only possible with preg_split and is not available in other functions.

So three possibilities:

1) convert your string to UTF8, use preg_split with PREG_SPLIT_DELIM_CAPTURE, and use array_map to convert each items to the original encoding.

This way is the more simple. That is not the case in the second way. (Note that in general, it is more simple to work always in UTF8, instead of dealing with exotic encodings)

2) in place of a split-like function you need to use for example mb_ereg_search_regs to get the matched parts and to build the pattern like this:

delimiter|all_that_is_not_the_delimiter

(Note that the two branches of the alternation must be mutually exclusive and take care to write them in a way that makes impossible gaps between results. The first part must be at the beginning of the string and the last part must be at the end. Each part must be contiguous to the previous and so on.)

3) use mb_split with lookarounds. By definition, lookarounds are zero-width assertions and don't match any characters but only positions in the string. So you can use this kind of pattern that matches positions after or before the delimiter:

(?=delimiter)|(<=delimiter)

(The limitation of this way is that the subpattern in the lookbehind can't have a variable length (in other words, you can't use a quantifier inside), but it can be an alternation of fixed length subpatterns: (?<=subpat1|subpat2|subpat3) )



来源:https://stackoverflow.com/questions/30605173/php-mb-split-capturing-delimiters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!