PHP\'s wordwrap() function doesn\'t work correctly for multi-byte strings like UTF-8.
There are a few examples of mb safe functions in the comments, but with some di
Unicode text has many more potential word boundaries than 8-bit encodings, including 17 space separators, and the full width comma. This solution allows you to customize a list of word boundaries for your application.
Have you ever benchmarked the mb_* family of PHP built-ins? They don't scale well at all. By using a custom nextCharUtf8(), we can do the same job, but orders of magnitude faster, especially on large strings.
$width) {
if ($line) {
$lines[] = $line;
$lineLen = 0;
$line = '';
}
}
$line .= $chunk;
$lineLen += $len;
}
if ($line) {
$lines[] = $line;
}
return implode($break, $lines);
}
function nextCharUtf8(&$string, &$pointer)
{
// EOF
if (!isset($string[$pointer])) {
return null;
}
// Get the byte value at the pointer
$char = ord($string[$pointer]);
// ASCII
if ($char < 128) {
return $string[$pointer++];
}
// UTF-8
if ($char < 224) {
$bytes = 2;
} elseif ($char < 240) {
$bytes = 3;
} elseif ($char < 248) {
$bytes = 4;
} elseif ($char == 252) {
$bytes = 5;
} else {
$bytes = 6;
}
// Get full multibyte char
$str = substr($string, $pointer, $bytes);
// Increment pointer according to length of char
$pointer += $bytes;
// Return mb char
return $str;
}