multibyte | 易学教程

How does UTF-8 “variable-width encoding” work?

阅读更多关于 How does UTF-8 “variable-width encoding” work?

问题 The unicode standard has enough code-points in it that you need 4 bytes to store them all. That\'s what the UTF-32 encoding does. Yet the UTF-8 encoding somehow squeezes these into much smaller spaces by using something called \"variable-width encoding\". In fact, it manages to represent the first 127 characters of US-ASCII in just one byte which looks exactly like real ASCII, so you can interpret lots of ascii text as if it were UTF-8 without doing anything to it. Neat trick. So how does it

str_replace() on multibyte strings dangerous?

阅读更多关于 str_replace() on multibyte strings dangerous?

问题 Given certain multibyte character sets, am I correct in assuming that the following doesn\'t do what it was intended to do? $string = str_replace(\'\"\', \'\\\\\"\', $string); In particular, if the input was in a character set that might have a valid character like 0xbf5c, so an attacker can inject 0xbf22 to get 0xbf5c22, leaving a valid character followed by an unquoted double quote (\"). Is there an easy way to mitigate this problem, or am I misunderstanding the issue in the first place?

Are the PHP preg_functions multibyte safe?

阅读更多关于 Are the PHP preg_functions multibyte safe?

问题 There are no multibyte \'preg\' functions available in PHP, so does that mean the default preg_functions are all mb safe? Couldn\'t find any mention in the php documentation. 回答1: PCRE can support UTF-8 and other Unicode encodings, but it has to be specified at compile time. From the man page for PCRE 8.0: The current implementation of PCRE corresponds approximately with Perl 5.10, including support for UTF-8 encoded strings and Unicode general category properties. However, UTF-8 and Unicode

Printing UTF-8 strings with printf - wide vs. multibyte string literals

阅读更多关于 Printing UTF-8 strings with printf - wide vs. multibyte string literals

问题 In statements like these, where both are entered into the source code with the same encoding (UTF-8) and the locale is set up properly, is there any practical difference between them? printf(\"ο Δικαιοπολις εν αγρω εστιν\\n\"); printf(\"%ls\", L\"ο Δικαιοπολις εν αγρω εστιν\\n\"); And consequently is there any reason to prefer one over the other when doing output? I imagine the second performs a fair bit worse, but does it have any advantage (or disadvantage) over a multibyte literal? EDIT:

Truncate a multibyte String to n chars

阅读更多关于 Truncate a multibyte String to n chars

问题 I am trying to get this method in a String Filter working: public function truncate($string, $chars = 50, $terminator = \' …\'); I\'d expect this $in = \"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ1234567890\"; $out = \"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …\"; and also this $in = \"âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝ\"; $out = \"âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđ …\"; That is $chars minus the chars of the $terminator string. In