Multi-byte safe wordwrap() function for UTF-8

前端 未结 9 1010
太阳男子
太阳男子 2020-12-01 13:17

PHP\'s wordwrap() function doesn\'t work correctly for multi-byte strings like UTF-8.

There are a few examples of mb safe functions in the comments, but with some di

9条回答
  •  不知归路
    2020-12-01 13:27

    Custom word boundaries

    Unicode text has many more potential word boundaries than 8-bit encodings, including 17 space separators, and the full width comma. This solution allows you to customize a list of word boundaries for your application.

    Better performance

    Have you ever benchmarked the mb_* family of PHP built-ins? They don't scale well at all. By using a custom nextCharUtf8(), we can do the same job, but orders of magnitude faster, especially on large strings.

     $width) {
          if ($line) {
            $lines[] = $line;
            $lineLen = 0;
            $line = '';
          }
        }
        $line .= $chunk;
        $lineLen += $len;
      }
      if ($line) {
        $lines[] = $line;
      }
      return implode($break, $lines);
    }
    
    function nextCharUtf8(&$string, &$pointer)
    {
      // EOF
      if (!isset($string[$pointer])) {
        return null;
      }
    
      // Get the byte value at the pointer
      $char = ord($string[$pointer]);
    
      // ASCII
      if ($char < 128) {
        return $string[$pointer++];
      }
    
      // UTF-8
      if ($char < 224) {
        $bytes = 2;
      } elseif ($char < 240) {
        $bytes = 3;
      } elseif ($char < 248) {
        $bytes = 4;
      } elseif ($char == 252) {
        $bytes = 5;
      } else {
        $bytes = 6;
      }
    
      // Get full multibyte char
      $str = substr($string, $pointer, $bytes);
    
      // Increment pointer according to length of char
      $pointer += $bytes;
    
      // Return mb char
      return $str;
    }
    

提交回复
热议问题