How do I find the number of bytes within UTF-8 string with PHP?

我只是一个虾纸丫 提交于 2019-12-04 05:23:28

I am asking this as I need to shorten a utf-8 string to a certain number of bytes.

mb_strcut() does exactly this, though you might not be able to tell from the barely comprehensible documentation.

strlen() returns the number of bytes.

Shortening a multibyte string to a certain number of bytes is a separate task. You will need to take care not to chop the string off in the middle of a multibyte sequence as you shorten it.

The other thing you need to handle is that when you put a string into json notation, it might need more bytes to represent it as json. For example, if your string contains a double quote character. It needs to be escaped, and the backslash character will add one byte. There's other characters that need to be escaped too. Point is, it can get larger. I assume the byte limit is on the total json payload, so you do need to account for the json syntax itself, as well as any escaping that json will impose on your string.

An unoptimized, kinda hacky way to do it is to chop the string, at say 5 bytes more than your limit, using substr(). Now use mb_strlen() to get number of characters, and mb_substr() to remove the last character. Now encode it as json, and measure the bytes via strlen(). Enter a loop, which keeps chopping off the last character using mb_substr(), encodes as json, and again measure bytes using strlen(). The loop terminates when the number of bytes is acceptable.

In PHP 5, mb_strlen should return the number of characters ; and strlen should return the number of bytes.

For instance, this portion of code :

$string = 'По своей природе компьютеры могут работать лишь с числами. И для того, чтобы они могли хранить в памяти буквы или другие символы, каждому такому символу должно быть поставлено в соответствие число';
echo mb_strlen($string, 'UTF-8') . '<br />';
echo strlen($string);

Should get you the following output :

196
359


As a sidenote : this is one the the things that PHP 6 will change : PHP 6 will be using Unicode by default, which means strlen should, in PHP 6, return the number of characters, and not a number of bytes anymore.

If you wish to find the byte length of a multi-byte string when you are using mbstring.func_overload 2 and UTF-8 strings, then you can use the following:

mb_strlen($utf8_string, 'latin1');

Count of Bytes <> String length!

to get the count of byte you can use (php4,5) strlen. to get the unicode string (utf8 encoded) length you can use mb_strlen (take care about function overloading from that extension) or you can simply count all bytes which do not have the 8th bit set.

8th bit means for this unicodechar is coming at least one more byte from input.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!