问题
PHP's wordwrap() function doesn't work correctly for multi-byte strings like UTF-8.
There are a few examples of mb safe functions in the comments, but with some different test data they all seem to have some problems.
The function should take the exact same parameters as wordwrap()
.
Specifically be sure it works to:
- cut mid-word if
$cut = true
, don't cut mid-word otherwise - not insert extra spaces in words if
$break = ' '
- also work for
$break = "\n"
- work for ASCII, and all valid UTF-8
回答1:
I haven't found any working code for me. Here is what I've written. For me it is working, thought it is probably not the fastest.
function mb_wordwrap($str, $width = 75, $break = "\n", $cut = false) {
$lines = explode($break, $str);
foreach ($lines as &$line) {
$line = rtrim($line);
if (mb_strlen($line) <= $width)
continue;
$words = explode(' ', $line);
$line = '';
$actual = '';
foreach ($words as $word) {
if (mb_strlen($actual.$word) <= $width)
$actual .= $word.' ';
else {
if ($actual != '')
$line .= rtrim($actual).$break;
$actual = $word;
if ($cut) {
while (mb_strlen($actual) > $width) {
$line .= mb_substr($actual, 0, $width).$break;
$actual = mb_substr($actual, $width);
}
}
$actual .= ' ';
}
}
$line .= trim($actual);
}
return implode($break, $lines);
}
回答2:
/**
* wordwrap for utf8 encoded strings
*
* @param string $str
* @param integer $len
* @param string $what
* @return string
* @author Milian Wolff <mail@milianw.de>
*/
function utf8_wordwrap($str, $width, $break, $cut = false) {
if (!$cut) {
$regexp = '#^(?:[\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+){'.$width.',}\b#U';
} else {
$regexp = '#^(?:[\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+){'.$width.'}#';
}
if (function_exists('mb_strlen')) {
$str_len = mb_strlen($str,'UTF-8');
} else {
$str_len = preg_match_all('/[\x00-\x7F\xC0-\xFD]/', $str, $var_empty);
}
$while_what = ceil($str_len / $width);
$i = 1;
$return = '';
while ($i < $while_what) {
preg_match($regexp, $str,$matches);
$string = $matches[0];
$return .= $string.$break;
$str = substr($str, strlen($string));
$i++;
}
return $return.$str;
}
Total time: 0.0020880699 is good time :)
回答3:
Because no answer was handling every use case, here is something that does. The code is based on Drupal’s AbstractStringWrapper::wordWrap.
<?php
/**
* Wraps any string to a given number of characters.
*
* This implementation is multi-byte aware and relies on {@link
* http://www.php.net/manual/en/book.mbstring.php PHP's multibyte
* string extension}.
*
* @see wordwrap()
* @link https://api.drupal.org/api/drupal/core%21vendor%21zendframework%21zend-stdlib%21Zend%21Stdlib%21StringWrapper%21AbstractStringWrapper.php/function/AbstractStringWrapper%3A%3AwordWrap/8
* @param string $string
* The input string.
* @param int $width [optional]
* The number of characters at which <var>$string</var> will be
* wrapped. Defaults to <code>75</code>.
* @param string $break [optional]
* The line is broken using the optional break parameter. Defaults
* to <code>"\n"</code>.
* @param boolean $cut [optional]
* If the <var>$cut</var> is set to <code>TRUE</code>, the string is
* always wrapped at or before the specified <var>$width</var>. So if
* you have a word that is larger than the given <var>$width</var>, it
* is broken apart. Defaults to <code>FALSE</code>.
* @return string
* Returns the given <var>$string</var> wrapped at the specified
* <var>$width</var>.
*/
function mb_wordwrap($string, $width = 75, $break = "\n", $cut = false) {
$string = (string) $string;
if ($string === '') {
return '';
}
$break = (string) $break;
if ($break === '') {
trigger_error('Break string cannot be empty', E_USER_ERROR);
}
$width = (int) $width;
if ($width === 0 && $cut) {
trigger_error('Cannot force cut when width is zero', E_USER_ERROR);
}
if (strlen($string) === mb_strlen($string)) {
return wordwrap($string, $width, $break, $cut);
}
$stringWidth = mb_strlen($string);
$breakWidth = mb_strlen($break);
$result = '';
$lastStart = $lastSpace = 0;
for ($current = 0; $current < $stringWidth; $current++) {
$char = mb_substr($string, $current, 1);
$possibleBreak = $char;
if ($breakWidth !== 1) {
$possibleBreak = mb_substr($string, $current, $breakWidth);
}
if ($possibleBreak === $break) {
$result .= mb_substr($string, $lastStart, $current - $lastStart + $breakWidth);
$current += $breakWidth - 1;
$lastStart = $lastSpace = $current + 1;
continue;
}
if ($char === ' ') {
if ($current - $lastStart >= $width) {
$result .= mb_substr($string, $lastStart, $current - $lastStart) . $break;
$lastStart = $current + 1;
}
$lastSpace = $current;
continue;
}
if ($current - $lastStart >= $width && $cut && $lastStart >= $lastSpace) {
$result .= mb_substr($string, $lastStart, $current - $lastStart) . $break;
$lastStart = $lastSpace = $current;
continue;
}
if ($current - $lastStart >= $width && $lastStart < $lastSpace) {
$result .= mb_substr($string, $lastStart, $lastSpace - $lastStart) . $break;
$lastStart = $lastSpace = $lastSpace + 1;
continue;
}
}
if ($lastStart !== $current) {
$result .= mb_substr($string, $lastStart, $current - $lastStart);
}
return $result;
}
?>
回答4:
function mb_wordwrap($str, $width = 74, $break = "\r\n", $cut = false)
{
return preg_replace(
'~(?P<str>.{' . $width . ',}?' . ($cut ? '(?(?!.+\s+)\s*|\s+)' : '\s+') . ')(?=\S+)~mus',
'$1' . $break,
$str
);
}
回答5:
Custom word boundaries
Unicode text has many more potential word boundaries than 8-bit encodings, including 17 space separators, and the full width comma. This solution allows you to customize a list of word boundaries for your application.
Better performance
Have you ever benchmarked the mb_*
family of PHP built-ins? They don't scale well at all. By using a custom nextCharUtf8()
, we can do the same job, but orders of magnitude faster, especially on large strings.
<?php
function wordWrapUtf8(
string $phrase,
int $width = 75,
string $break = "\n",
bool $cut = false,
array $seps = [' ', "\n", "\t", ',']
): string
{
$chunks = [];
$chunk = '';
$len = 0;
$pointer = 0;
while (!is_null($char = nextCharUtf8($phrase, $pointer))) {
$chunk .= $char;
$len++;
if (in_array($char, $seps, true) || ($cut && $len === $width)) {
$chunks[] = [$len, $chunk];
$len = 0;
$chunk = '';
}
}
if ($chunk) {
$chunks[] = [$len, $chunk];
}
$line = '';
$lines = [];
$lineLen = 0;
foreach ($chunks as [$len, $chunk]) {
if ($lineLen + $len > $width) {
if ($line) {
$lines[] = $line;
$lineLen = 0;
$line = '';
}
}
$line .= $chunk;
$lineLen += $len;
}
if ($line) {
$lines[] = $line;
}
return implode($break, $lines);
}
function nextCharUtf8(&$string, &$pointer)
{
// EOF
if (!isset($string[$pointer])) {
return null;
}
// Get the byte value at the pointer
$char = ord($string[$pointer]);
// ASCII
if ($char < 128) {
return $string[$pointer++];
}
// UTF-8
if ($char < 224) {
$bytes = 2;
} elseif ($char < 240) {
$bytes = 3;
} elseif ($char < 248) {
$bytes = 4;
} elseif ($char == 252) {
$bytes = 5;
} else {
$bytes = 6;
}
// Get full multibyte char
$str = substr($string, $pointer, $bytes);
// Increment pointer according to length of char
$pointer += $bytes;
// Return mb char
return $str;
}
回答6:
Here is the multibyte wordwrap function i have coded taking inspiration from of others found on the internet.
function mb_wordwrap($long_str, $width = 75, $break = "\n", $cut = false) {
$long_str = html_entity_decode($long_str, ENT_COMPAT, 'UTF-8');
$width -= mb_strlen($break);
if ($cut) {
$short_str = mb_substr($long_str, 0, $width);
$short_str = trim($short_str);
}
else {
$short_str = preg_replace('/^(.{1,'.$width.'})(?:\s.*|$)/', '$1', $long_str);
if (mb_strlen($short_str) > $width) {
$short_str = mb_substr($short_str, 0, $width);
}
}
if (mb_strlen($long_str) != mb_strlen($short_str)) {
$short_str .= $break;
}
return $short_str;
}
Dont' forget to configure PHP for using UTF-8 with :
ini_set('default_charset', 'UTF-8');
mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');
I hope this will help. Guillaume
回答7:
Just want to share some alternative I found on the net.
<?php
if ( !function_exists('mb_str_split') ) {
function mb_str_split($string, $split_length = 1)
{
mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');
$split_length = ($split_length <= 0) ? 1 : $split_length;
$mb_strlen = mb_strlen($string, 'utf-8');
$array = array();
for($i = 0; $i < $mb_strlen; $i += $split_length) {
$array[] = mb_substr($string, $i, $split_length);
}
return $array;
}
}
Using mb_str_split
, you can use join
to combine the words with <br>
.
<?php
$text = '<utf-8 content>';
echo join('<br>', mb_str_split($text, 20));
And finally create your own helper, perhaps mb_textwrap
<?php
if( !function_exists('mb_textwrap') ) {
function mb_textwrap($text, $length = 20, $concat = '<br>')
{
return join($concat, mb_str_split($text, $length));
}
}
$text = '<utf-8 content>';
// so simply call
echo mb_textwrap($text);
See screenshot demo:
回答8:
Here's my own attempt at a function that passed a few of my own tests, though I can't promise it's 100% perfect, so please post a better one if you see a problem.
/**
* Multi-byte safe version of wordwrap()
* Seems to me like wordwrap() is only broken on UTF-8 strings when $cut = true
* @return string
*/
function wrap($str, $len = 75, $break = " ", $cut = true) {
$len = (int) $len;
if (empty($str))
return "";
$pattern = "";
if ($cut)
$pattern = '/([^'.preg_quote($break).']{'.$len.'})/u';
else
return wordwrap($str, $len, $break);
return preg_replace($pattern, "\${1}".$break, $str);
}
回答9:
This one seems to work well...
function mb_wordwrap($str, $width = 75, $break = "\n", $cut = false, $charset = null) {
if ($charset === null) $charset = mb_internal_encoding();
$pieces = explode($break, $str);
$result = array();
foreach ($pieces as $piece) {
$current = $piece;
while ($cut && mb_strlen($current) > $width) {
$result[] = mb_substr($current, 0, $width, $charset);
$current = mb_substr($current, $width, 2048, $charset);
}
$result[] = $current;
}
return implode($break, $result);
}
来源:https://stackoverflow.com/questions/3825226/multi-byte-safe-wordwrap-function-for-utf-8