Declaration to make PHP script completely Unicode-friendly

前端未结

关注

 2  536

Remembering to do all the stuff you need to do in PHP to get it to work properly with Unicode is far too tricky, tedious, and error-prone, so I\'m looking for the trick to g

相关标签:

2条回答

无人共我

2020-12-16 19:32

That full-unicode thing was precisely the idea of PHP 6 -- which has been canceled more than one year ago.

So, no, there is no way of getting all that -- except by using the right functions, and remembering that characters are not the same as bytes.

One thing that might help with you fourth point, though, is the Function Overloading Feature of the mbstring extension (quoting) :

mbstring supports a 'function overloading' feature which enables you to add multibyte awareness to such an application without code modification by overloading multibyte counterparts on the standard string functions.
For example, mb_substr() is called instead of substr() if function overloading is enabled.

0 讨论(0)
发布评论:

提交评论
- 加载中...
北恋

2020-12-16 19:34

All byte functions (eg, strlen, strstr, strpos, and substr) work like the corresponding character functions (eg, mb_strlen, mb_strstr, mb_strpos, and mb_substr).

This isn't a good idea.

Unicode strings cannot transparently replace byte strings. Even when you are correctly handling all human-readable text as Unicode, there are still important uses for byte strings in handling file and network data that isn't character-based, and interacting with systems that explicitly use bytes.

For example, spit out a header 'Content-Length: '.strlen($imageblob) and you're going to get brokenness if that's suddenly using codepoint semantics.

You still need to have both mb_strlen and strlen, and you have to know which is the right one to use in each circumstance; there's not a single switch you can throw to automatically do the right thing.

This is why IMO the approach of having a single string datatype that can be treated with byte or codepoint semantics is generally a mistake. Languages that provide separate datatypes for byte strings (with byte semantics), and character strings (with Unicode codepoint semantics(*)) tend to be more consistent.

(*: or UTF-16 code unit semantics if unlucky)

0 讨论(0)
发布评论:

提交评论
- 加载中...