strlen() and UTF-8 encoding

匿名 (未验证) 提交于 2019-12-03 00:46:02

问题:

Assuming UTF-8 encoding, and strlen() in PHP, is it possible that this string has a length of 4?

I'm only interested to know about strlen(), not other functions

I have tested it on my own computer, and I have verified UTF-8 encoding, and the answer I get is 6.

I don't see anything in the manual for strlen or anything I've read on UTF-8 that would explain why some of the characters above would count for less than one.

PS: This question and answer (4) comes from a mock test for ZCE I bought on Ebay.

PPS: Please throw me a bone and vote this up. I did my homework. Thanks in advance to all replies and votes.

回答1:

If strlen() was called with a UTF-8 representation of that string, you would get a result of nine (probably, though there are multiple representations with different lengths).

The replacement character often gets inserted when a UTF-8 decoder reads data that's not valid UTF-8 data.



回答2:

how about using mb_strlen() ?

http://lt.php.net/manual/en/function.mb-strlen.php

But if you need to use strlen, its possible to configure your webserver by setting mbstring.func_overload directive to 2, so it will automatically replace using of strlen to mb_strlen in your scripts.



回答3:

need to use Multibyte String Function mb_strlen() like:

mb_strlen($string, 'UTF-8'); 


回答4:

It's likely that at some point between the preparation of the question and your reading of it some process has mangled non-ASCII characters in it, so the question was originally about some string with 4 characters in it.

The sequence is obtained when you encode the replacement character U+FFFD

The original question, stored in a latin1 text file, had:

. This text was then written out using UTF-8, resulting in $1\xEF\xBF\xBD2 in the file.

Then some third program comes that reads the file in latin1, and shows .



回答5:

No.

I'll use a proof by contradiction.

strlen counts bytes, so with a strlen of 4, there would need to be exactly 4 bytes in that string.

UTF8 encoding needs at least 1 byte per character.

We have established that:

  1. there are 4 bytes
  2. a character is represented by no less than 1 byte

...yet, we have 6 characters....which is a contradiction. So, no.

However, what's not totally clear is which character set the displaying software(eg, the web browser) is using to intepret the string. It could use some uncommon encoding scheme where a character can be represented by less than 8 bits. If this were the case, then 4 bytes could display as 6 characters. So, the string could be utf8, but the browser could decide to interpret it as, say, some 5 bit character set.



回答6:

Many UTF-8 characters take several bytes instead of one. That's how UTF-8 is constructed (That's how you can have so many characters in a single set).

Try mb_strlen() instead.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!