multibyte

PHP mb_substr() not working correctly?

天大地大妈咪最大 提交于 2019-12-04 00:00:00
This code print mb_substr('éxxx', 0, 1); prints an empty space :( It is supposed to print the first character, é . This seems to work however: print mb_substr('éxxx', 0, 2); But it's not right, because (0, 2) means 2 characters... Try passing the encoding parameter to mb_substr, as such: print mb_substr('éxxx', 0, 1, 'utf-8'); The encoding is never detected automatically. In practice I've found that, in some systems, multi-byte functions default to ISO-8859-1 for internal encoding. That effectively ruins their ability to handle multi-byte text. Setting a good default will probably fix this and

Does multibyte character interfere with end-line character within a regex?

ⅰ亾dé卋堺 提交于 2019-12-03 01:51:44
With this regex: regex1 = /\z/ the following strings match: "hello" =~ regex1 # => 5 "こんにちは" =~ regex1 # => 5 but with these regexes: regex2 = /#$/?\z/ regex3 = /\n?\z/ they show difference: "hello" =~ regex2 # => 5 "hello" =~ regex3 # => 5 "こんにちは" =~ regex2 # => nil "こんにちは" =~ regex3 # => nil What is interfering? The string encoding is UTF-8, and the OS is Linux (i.e., $/ is "\n" ). Are the multibyte characters interfering with $/ ? How? The problem you reported is definitely a bug of the Regexp of RUBY_VERSION #=> "2.0.0" but already existing in previous 1.9 when the encoding allow multi

UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!

六月ゝ 毕业季﹏ 提交于 2019-12-03 01:15:48
问题 I've managed to mostly ignore all this multi-byte character stuff, but now I need to do some UI work and I know my ignorance in this area is going to catch up with me! Can anyone explain in a few paragraphs or less just what I need to know so that I can localize my applications? What types should I be using (I use both .Net and C/C++, and I need this answer for both Unix and Windows). 回答1: Check out Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know

How to check if the word is Japanese or English using PHP

爷,独闯天下 提交于 2019-12-02 15:59:49
I want to have different process for English word and Japanese word in this function function process_word($word) { if($word is english) { ///////// }else if($word is japanese) { //////// } } thank you Alix Axel A quick solution that doesn't need the mb_string extension: if (strlen($str) != strlen(utf8_decode($str))) { // $str uses multi-byte chars (isn't English) } else { // $str is ASCII (probably English) } Or a modification of the solution provided by @Alexander Konstantinov : function isKanji($str) { return preg_match('/[\x{4E00}-\x{9FBF}]/u', $str) > 0; } function isHiragana($str) {

UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!

偶尔善良 提交于 2019-12-02 14:30:26
I've managed to mostly ignore all this multi-byte character stuff, but now I need to do some UI work and I know my ignorance in this area is going to catch up with me! Can anyone explain in a few paragraphs or less just what I need to know so that I can localize my applications? What types should I be using (I use both .Net and C/C++, and I need this answer for both Unix and Windows). Check out Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) EDIT 20140523 : Also, watch Characters, Symbols and the

PHP mb_ereg_replace not replacing while preg_replace works as intended

£可爱£侵袭症+ 提交于 2019-12-01 20:11:24
问题 I am trying to replace in a string all non word characters with empty string expect for spaces and the put together all multiple spaces as one single space. Following code does this. $cleanedString = preg_replace('/[^\w]/', ' ', $name); $cleanedString = preg_replace('/\s+/', ' ', $cleanedString); But when I am trying to use mb_ereg_replace nothing happens. $cleanedString = mb_ereg_replace('/[^\w]/', ' ', $name); $cleanedString = mb_ereg_replace('/\s+/', ' ', $cleanedString); $cleanedString is

PHP mb_ereg_replace not replacing while preg_replace works as intended

 ̄綄美尐妖づ 提交于 2019-12-01 18:05:16
I am trying to replace in a string all non word characters with empty string expect for spaces and the put together all multiple spaces as one single space. Following code does this. $cleanedString = preg_replace('/[^\w]/', ' ', $name); $cleanedString = preg_replace('/\s+/', ' ', $cleanedString); But when I am trying to use mb_ereg_replace nothing happens. $cleanedString = mb_ereg_replace('/[^\w]/', ' ', $name); $cleanedString = mb_ereg_replace('/\s+/', ' ', $cleanedString); $cleanedString is same as of that if $name in the above case. What am I doing wrong? Artefacto mb_ereg_replace doesn't

Merging two Regular Expressions to Truncate Words in Strings

混江龙づ霸主 提交于 2019-12-01 16:59:44
问题 I'm trying to come up with the following function that truncates string to whole words (if possible, otherwise it should truncate to chars): function Text_Truncate($string, $limit, $more = '...') { $string = trim(html_entity_decode($string, ENT_QUOTES, 'UTF-8')); if (strlen(utf8_decode($string)) > $limit) { $string = preg_replace('~^(.{1,' . intval($limit) . '})(?:\s.*|$)~su', '$1', $string); if (strlen(utf8_decode($string)) > $limit) { $string = preg_replace('~^(.{' . intval($limit) . '}).*

Convert unicode characters above 127 to decimal [duplicate]

一个人想着一个人 提交于 2019-12-01 12:44:59
问题 This question already has an answer here : Closed 7 years ago . Possible Duplicate: How to convert text to unicode code point like \u0054\u0068\u0069\u0073 using php? I'm trying to convert all characters that can't fit into a 7-bit ANSI character into an escaped form, \uN , where N is its decimal value. Here's what I've come up with: private static function escape($str) { return preg_replace_callback('~[\\x{007F}-\\x{FFFF}]~u',function($m){return '\\u'.ord($m[0]);},$str); } I've tried it with

Why are PHP string functions not multi-byte safe by default?

点点圈 提交于 2019-12-01 07:31:27
问题 Why are the PHP multi-byte string functions (the ones which start with mb_ ) not used by default in PHP? 回答1: Backwards compatibility. Old PHP scripts depend on non-multibyte functionality. See also: http://www.php.net/manual/en/mbstring.overload.php 回答2: Because non-multibyte functions were there first. 来源: https://stackoverflow.com/questions/12716064/why-are-php-string-functions-not-multi-byte-safe-by-default