multibyte | 易学教程

Issue with utf-8 encoding using PHP + MySQL

阅读更多关于 Issue with utf-8 encoding using PHP + MySQL

问题 I moved data from MySQL 4 (they were originally set to latin2 encoding ) to MySQL 5 and set encoding to utf-8 . It looks good in phpMyAdmin , and utf-8 is okay. However there are question marks instead of some characters on website! The website encoding is also set to utf8 so I dont understand where the problem is. PHP and HTML files are also set to utf8 . I have no idea... 回答1: try query SET NAMES utf8 before any query in your application 回答2: On my server, adding these to my php file had no

Multi-byte safe wordwrap() function for UTF-8

阅读更多关于 Multi-byte safe wordwrap() function for UTF-8

PHP's wordwrap() function doesn't work correctly for multi-byte strings like UTF-8. There are a few examples of mb safe functions in the comments, but with some different test data they all seem to have some problems. The function should take the exact same parameters as wordwrap() . Specifically be sure it works to: cut mid-word if $cut = true , don't cut mid-word otherwise not insert extra spaces in words if $break = ' ' also work for $break = "\n" work for ASCII, and all valid UTF-8 I haven't found any working code for me. Here is what I've written. For me it is working, thought it is

Multibyte trim in PHP?

阅读更多关于 Multibyte trim in PHP?

Apparently there's no mb_trim in the mb_* family , so I'm trying to implement one for my own. I recently found this regex in a comment in php.net : /(^\s+)|(\s+$)/u So, I'd implement it in the following way: function multibyte_trim($str) { if (!function_exists("mb_trim") || !extension_loaded("mbstring")) { return preg_replace("/(^\s+)|(\s+$)/u", "", $str); } else { return mb_trim($str); } } The regex seems correct to me, but I'm extremely noob with regular expressions. Will this effectively remove any Unicode space in the beginning/end of a string? deceze The standard trim function trims a

Invalid URI with Chinese characters (Java)

阅读更多关于 Invalid URI with Chinese characters (Java)

问题 Having trouble setting up a URL connection with Chinese characters in the URL. It works with Latin characters: String xstr = "维也纳恩斯特哈佩尔球场" ; URI uri = new URI("http","ajax.googleapis.com","/ajax/services/language/detect","v=1.0&q="+xstr,null); URL url = uri.toURL(); URLConnection connection = url.openConnection(); InputStream is = connection.getInputStream() ; The getInputStream() call results in: java.lang.IllegalArgumentException: Invalid uri 'http://ajax.googleapis.com/ajax/services

How does UTF-8 “variable-width encoding” work?

阅读更多关于 How does UTF-8 “variable-width encoding” work?

The unicode standard has enough code-points in it that you need 4 bytes to store them all. That's what the UTF-32 encoding does. Yet the UTF-8 encoding somehow squeezes these into much smaller spaces by using something called "variable-width encoding". In fact, it manages to represent the first 127 characters of US-ASCII in just one byte which looks exactly like real ASCII, so you can interpret lots of ascii text as if it were UTF-8 without doing anything to it. Neat trick. So how does it work? I'm going to ask and answer my own question here because I just did a bit of reading to figure it

str_replace() on multibyte strings dangerous?

阅读更多关于 str_replace() on multibyte strings dangerous?

Given certain multibyte character sets, am I correct in assuming that the following doesn't do what it was intended to do? $string = str_replace('"', '\\"', $string); In particular, if the input was in a character set that might have a valid character like 0xbf5c, so an attacker can inject 0xbf22 to get 0xbf5c22, leaving a valid character followed by an unquoted double quote ("). Is there an easy way to mitigate this problem, or am I misunderstanding the issue in the first place? (In my case, the string is going into the value attribute of an HTML input tag: echo 'input type="text" value="' .

multibyte strtr() -> mb_strtr()

阅读更多关于 multibyte strtr() -> mb_strtr()

问题 Does anyone have written multibyte variant of function strtr() ? I need this one. Edit 1 (example of desired usage): Example: $from = 'ľľščťžýáíŕďňäô'; // these chars are in UTF-8 $to = 'llsctzyaiŕdnao'; // input - in UTF-8 $str = 'Kŕdeľ ďatľov učí koňa žrať kôru.'; $str = mb_strtr( $str, $from, $to ); // output - str without diacritic // $str = 'Krdel datlov uci kona zrat koru.'; 回答1: I believe strtr is multi-byte safe, either way since str_replace is multi-byte safe you could wrap it:

Are the PHP preg_functions multibyte safe?

阅读更多关于 Are the PHP preg_functions multibyte safe?

There are no multibyte 'preg' functions available in PHP, so does that mean the default preg_functions are all mb safe? Couldn't find any mention in the php documentation. outis PCRE can support UTF-8 and other Unicode encodings, but it has to be specified at compile time. From the man page for PCRE 8.0 : The current implementation of PCRE corresponds approximately with Perl 5.10, including support for UTF-8 encoded strings and Unicode general category properties. However, UTF-8 and Unicode support has to be explicitly enabled; it is not the default. The Unicode tables correspond to Unicode

Multi-byte safe wordwrap() function for UTF-8

阅读更多关于 Multi-byte safe wordwrap() function for UTF-8

问题 PHP's wordwrap() function doesn't work correctly for multi-byte strings like UTF-8. There are a few examples of mb safe functions in the comments, but with some different test data they all seem to have some problems. The function should take the exact same parameters as wordwrap() . Specifically be sure it works to: cut mid-word if $cut = true , don't cut mid-word otherwise not insert extra spaces in words if $break = ' ' also work for $break = "\n" work for ASCII, and all valid UTF-8 回答1: I

Printing UTF-8 strings with printf - wide vs. multibyte string literals

阅读更多关于 Printing UTF-8 strings with printf - wide vs. multibyte string literals

In statements like these, where both are entered into the source code with the same encoding (UTF-8) and the locale is set up properly, is there any practical difference between them? printf("ο Δικαιοπολις εν αγρω εστιν\n"); printf("%ls", L"ο Δικαιοπολις εν αγρω εστιν\n"); And consequently is there any reason to prefer one over the other when doing output? I imagine the second performs a fair bit worse, but does it have any advantage (or disadvantage) over a multibyte literal? EDIT: There are no issues with these strings printing. But I'm not using the wide string functions, because I want to