multibyte | 易学教程

PHP mb_substr() not working correctly?

阅读更多关于 PHP mb_substr() not working correctly?

This code print mb_substr('éxxx', 0, 1); prints an empty space :( It is supposed to print the first character, é . This seems to work however: print mb_substr('éxxx', 0, 2); But it's not right, because (0, 2) means 2 characters... Try passing the encoding parameter to mb_substr, as such: print mb_substr('éxxx', 0, 1, 'utf-8'); The encoding is never detected automatically. In practice I've found that, in some systems, multi-byte functions default to ISO-8859-1 for internal encoding. That effectively ruins their ability to handle multi-byte text. Setting a good default will probably fix this and

Does multibyte character interfere with end-line character within a regex?

阅读更多关于 Does multibyte character interfere with end-line character within a regex?

With this regex: regex1 = /\z/ the following strings match: "hello" =~ regex1 # => 5 "こんにちは" =~ regex1 # => 5 but with these regexes: regex2 = /#$/?\z/ regex3 = /\n?\z/ they show difference: "hello" =~ regex2 # => 5 "hello" =~ regex3 # => 5 "こんにちは" =~ regex2 # => nil "こんにちは" =~ regex3 # => nil What is interfering? The string encoding is UTF-8, and the OS is Linux (i.e., $/ is "\n" ). Are the multibyte characters interfering with $/ ? How? The problem you reported is definitely a bug of the Regexp of RUBY_VERSION #=> "2.0.0" but already existing in previous 1.9 when the encoding allow multi

UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!

阅读更多关于 UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!

问题 I've managed to mostly ignore all this multi-byte character stuff, but now I need to do some UI work and I know my ignorance in this area is going to catch up with me! Can anyone explain in a few paragraphs or less just what I need to know so that I can localize my applications? What types should I be using (I use both .Net and C/C++, and I need this answer for both Unix and Windows). 回答1: Check out Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know

How to check if the word is Japanese or English using PHP

阅读更多关于 How to check if the word is Japanese or English using PHP

I want to have different process for English word and Japanese word in this function function process_word($word) { if($word is english) { ///////// }else if($word is japanese) { //////// } } thank you Alix Axel A quick solution that doesn't need the mb_string extension: if (strlen($str) != strlen(utf8_decode($str))) { // $str uses multi-byte chars (isn't English) } else { // $str is ASCII (probably English) } Or a modification of the solution provided by @Alexander Konstantinov : function isKanji($str) { return preg_match('/[\x{4E00}-\x{9FBF}]/u', $str) > 0; } function isHiragana($str) {

UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!

阅读更多关于 UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!

I've managed to mostly ignore all this multi-byte character stuff, but now I need to do some UI work and I know my ignorance in this area is going to catch up with me! Can anyone explain in a few paragraphs or less just what I need to know so that I can localize my applications? What types should I be using (I use both .Net and C/C++, and I need this answer for both Unix and Windows). Check out Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) EDIT 20140523 : Also, watch Characters, Symbols and the

PHP mb_ereg_replace not replacing while preg_replace works as intended

阅读更多关于 PHP mb_ereg_replace not replacing while preg_replace works as intended

问题 I am trying to replace in a string all non word characters with empty string expect for spaces and the put together all multiple spaces as one single space. Following code does this. $cleanedString = preg_replace('/[^\w]/', ' ', $name); $cleanedString = preg_replace('/\s+/', ' ', $cleanedString); But when I am trying to use mb_ereg_replace nothing happens. $cleanedString = mb_ereg_replace('/[^\w]/', ' ', $name); $cleanedString = mb_ereg_replace('/\s+/', ' ', $cleanedString); $cleanedString is

PHP mb_ereg_replace not replacing while preg_replace works as intended

阅读更多关于 PHP mb_ereg_replace not replacing while preg_replace works as intended

I am trying to replace in a string all non word characters with empty string expect for spaces and the put together all multiple spaces as one single space. Following code does this. $cleanedString = preg_replace('/[^\w]/', ' ', $name); $cleanedString = preg_replace('/\s+/', ' ', $cleanedString); But when I am trying to use mb_ereg_replace nothing happens. $cleanedString = mb_ereg_replace('/[^\w]/', ' ', $name); $cleanedString = mb_ereg_replace('/\s+/', ' ', $cleanedString); $cleanedString is same as of that if $name in the above case. What am I doing wrong? Artefacto mb_ereg_replace doesn't

Merging two Regular Expressions to Truncate Words in Strings

阅读更多关于 Merging two Regular Expressions to Truncate Words in Strings

问题 I'm trying to come up with the following function that truncates string to whole words (if possible, otherwise it should truncate to chars): function Text_Truncate($string, $limit, $more = '...') { $string = trim(html_entity_decode($string, ENT_QUOTES, 'UTF-8')); if (strlen(utf8_decode($string)) > $limit) { $string = preg_replace('~^(.{1,' . intval($limit) . '})(?:\s.*|$)~su', '$1', $string); if (strlen(utf8_decode($string)) > $limit) { $string = preg_replace('~^(.{' . intval($limit) . '}).*

Convert unicode characters above 127 to decimal [duplicate]

阅读更多关于 Convert unicode characters above 127 to decimal [duplicate]

问题 This question already has an answer here : Closed 7 years ago . Possible Duplicate: How to convert text to unicode code point like \u0054\u0068\u0069\u0073 using php? I'm trying to convert all characters that can't fit into a 7-bit ANSI character into an escaped form, \uN , where N is its decimal value. Here's what I've come up with: private static function escape($str) { return preg_replace_callback('~[\\x{007F}-\\x{FFFF}]~u',function($m){return '\\u'.ord($m[0]);},$str); } I've tried it with

Why are PHP string functions not multi-byte safe by default?

阅读更多关于 Why are PHP string functions not multi-byte safe by default?

问题 Why are the PHP multi-byte string functions (the ones which start with mb_ ) not used by default in PHP? 回答1: Backwards compatibility. Old PHP scripts depend on non-multibyte functionality. See also: http://www.php.net/manual/en/mbstring.overload.php 回答2: Because non-multibyte functions were there first. 来源： https://stackoverflow.com/questions/12716064/why-are-php-string-functions-not-multi-byte-safe-by-default