What is the best way to split a string into an array of Unicode characters in PHP?

前端 未结 7 2421
野的像风
野的像风 2020-12-05 15:23

In PHP, what is the best way to split a string into an array of Unicode characters? If the input is not necessarily UTF-8?

I want to know whether the set of Unicode

7条回答
  •  [愿得一人]
    2020-12-05 15:37

    If for some reason the regex way isn't enough for you. I once wrote the Zend_Locale_UTF8 which is abandoned but might be helping you if you decide to do it on your own.

    In particular have a look at the class Zend_Locale_UTF8_PHP5_String which reads in Unicode strings and to work with them splits them up into single chars(which may consist out of multiple bytes obviously).

    EDIT: I just relaized that ZF's svn-browser is down so I copied the important methods for convenience:

    /**
     * Returns the UTF-8 code sequence as an array for any given $string.
     *
     * @access protected
     * @param string|integer $string
     * @return array
     */
    protected function _decode( $string ) {
    
        $string     = (string) $string;
        $length     = strlen($string);
        $sequence   = array();
    
        for ( $i=0; $i<$length; ) {
            $bytes      = $this->_characterBytes($string, $i);
            $ord        = $this->_ord($string, $bytes, $i);
    
            if ( $ord !== false )
                $sequence[] = $ord;
    
            if ( $bytes === false )
                $i++;
            else
                $i  += $bytes;
        }
    
        return $sequence;
    
    }
    
    /**
     * Returns the UTF-8 code of a character.
     *
     * @see http://en.wikipedia.org/wiki/UTF-8#Description
     * @access protected
     * @param string $string
     * @param integer $bytes
     * @param integer $position
     * @return integer
     */
    protected function _ord( &$string, $bytes = null, $pos=0 )
    {
        if ( is_null($bytes) )
            $bytes = $this->_characterBytes($string);
    
        if ( strlen($string) >= $bytes ) {
    
            switch ( $bytes ) {
                case 1:
                    return ord($string[$pos]);
                    break;
    
                case 2:
                    return  ( (ord($string[$pos])   & 0x1f) << 6 ) +
                            ( (ord($string[$pos+1]) & 0x3f) );
                    break;
    
                case 3:
                    return  ( (ord($string[$pos])   & 0xf)  << 12 ) + 
                            ( (ord($string[$pos+1]) & 0x3f) << 6 ) +
                            ( (ord($string[$pos+2]) & 0x3f) );
                    break;
    
                case 4:
                    return  ( (ord($string[$pos])   & 0x7)  << 18 ) + 
                            ( (ord($string[$pos+1]) & 0x3f) << 12 ) + 
                            ( (ord($string[$pos+1]) & 0x3f) << 6 ) +
                            ( (ord($string[$pos+2]) & 0x3f) );
                    break;
    
                case 0:
                default:
                    return false;
            }
        }
    
        return false;
    }
    /**
     * Returns the number of bytes of the $position-th character.
     *
     * @see http://en.wikipedia.org/wiki/UTF-8#Description
     * @access protected
     * @param string $string
     * @param integer $position
     */
    protected function _characterBytes( &$string, $position = 0 ) {
        $char       = $string[$position];
        $charVal    = ord($char);
    
        if ( ($charVal & 0x80) === 0 )
            return 1;
    
        elseif ( ($charVal & 0xe0) === 0xc0 )
            return 2;
    
        elseif ( ($charVal & 0xf0) === 0xe0 )
            return 3;
    
        elseif ( ($charVal & 0xf8) === 0xf0)
            return 4;
        /*
        elseif ( ($charVal & 0xfe) === 0xf8 )
            return 5;
        */
    
        return false;
    }
    

提交回复
热议问题