multibyte | 易学教程

PHP Multibyte String Functions

阅读更多关于 PHP Multibyte String Functions

问题 Today I ran into a problem with the php function strpos() because it returned FALSE even if the correct result was obviously 0. This was because one parameter was encoded in UTF-8, but the other (origin is a HTTP GET parameter) obviously not. Now I have noticed that using the mb_strpos function solved my problem. My question is now: Is it wisely to use the PHP multibyte string functions generally to avoid theses problems in future? Should I avoid the traditional strpos , strlen , ereg , etc.,

Ruby 1.9: how can I properly upcase & downcase multibyte strings?

阅读更多关于 Ruby 1.9: how can I properly upcase & downcase multibyte strings?

问题 So matz made the decision to keep upcase and downcase limited to /[A-Z]/i in ruby 1.9.1. ActiveSupport::Multibyte has long had great i18n case jiggering in ruby 1.8.x via String#mb_chars . However, when tried under ruby 1.9.1, it doesn't seem to work. Here's a simple test script I wrote, along with the output I'm getting: $ cat test.rb # encoding: UTF-8 puts("@ #{RUBY_VERSION} " + (__ENCODING__ rescue $KCODE).to_s) sd, su = "Iñtërnâtiônàlizætiøn", "IÑTËRNÂTIÔNÀLIZÆTIØN" def ps(u, d, k); puts

What does constitute one character for regcomp? Which multibyte encoding does determine this?

阅读更多关于 What does constitute one character for regcomp? Which multibyte encoding does determine this?

问题 regcomp (from glibc) is a POSIX function for compiling regular expressions. int regcomp(regex_t *restrict preg, const char *restrict pattern, int cflags); There are some constructions in regular expressions which depend on the idea of a single character, for example [abc] . If a multibyte encoding is used and a multibyte letter is used in the expression, the interpretation would be different if it treated either as a byte-sequence or a sequence of multibyte letters. Here I illustrate this

Check if file contains multibyte character

阅读更多关于 Check if file contains multibyte character

问题 I have some subtitle files in UTF-8. Sometimes there are some sporadic multibyte characters in these files which cause problem in some applications. How do I check in linux (and possibility locate these) if a certain file contains any multibyte character. 回答1: You can use file command chalet16$ echo test > a.txt chalet16$ echo testก > b.txt #One of Thai characters chalet16$ file *.txt a.txt: ASCII text b.txt: UTF-8 Unicode text 回答2: You can use file or chardet command. 来源： https:/

Converting accented characters in PostgreSQL?

阅读更多关于 Converting accented characters in PostgreSQL?

问题 Is there an existing function to replace accented characters with unadorned characters in PostgreSQL? Characters like å and ø should become a and o respectively. The closest thing I could find is the translate function, given the example in the comments section found here. Some commonly used accented characters can be searched using the following function: translate(search_terms, '\303\200\303\201\303\202\303\203\303\204\303\205\303\206\303\207\303\210\303\211\303\212\303\213\303\214\303\215

strip out multi-byte white space from a string PHP

阅读更多关于 strip out multi-byte white space from a string PHP

问题 I am trying to use a preg_replace to eliminate the Japanese full-width white space " 　 " from a string input but I end up with a corrupted multi-byte string. I would prefer to preg_replace instead of str_replace. Here is a sample code: $keywords = '　ラメ単色'; $keywords = str_replace(array(' ', '　'), ' ', urldecode($keywords)); // outputs :'ラメ単色' $keywords = preg_replace("@[ 　]@", ' ',urldecode($keywords)); // outputs :'�� 単色' Anyone has any idea as to why this is so and how to remedy this

Does multibyte character interfere with end-line character within a regex?

阅读更多关于 Does multibyte character interfere with end-line character within a regex?

问题 With this regex: regex1 = /\z/ the following strings match: "hello" =~ regex1 # => 5 "こんにちは" =~ regex1 # => 5 but with these regexes: regex2 = /#$/?\z/ regex3 = /\n?\z/ they show difference: "hello" =~ regex2 # => 5 "hello" =~ regex3 # => 5 "こんにちは" =~ regex2 # => nil "こんにちは" =~ regex3 # => nil What is interfering? The string encoding is UTF-8, and the OS is Linux (i.e., $/ is "\n" ). Are the multibyte characters interfering with $/ ? How? 回答1: The problem you reported is definitely a bug of

How to check if the word is Japanese or English using PHP

阅读更多关于 How to check if the word is Japanese or English using PHP

问题 I want to have different process for English word and Japanese word in this function function process_word($word) { if($word is english) { ///////// }else if($word is japanese) { //////// } } thank you 回答1: A quick solution that doesn't need the mb_string extension: if (strlen($str) != strlen(utf8_decode($str))) { // $str uses multi-byte chars (isn't English) } else { // $str is ASCII (probably English) } Or a modification of the solution provided by @Alexander Konstantinov: function isKanji(

How to handle multibyte string in Python

阅读更多关于 How to handle multibyte string in Python

问题 There are multibyte string functions in PHP to handle multibyte string (e.g:CJK script). For example, I want to count how many letters in a multi bytes string by using len function in python, but it return an inaccurate result (i.e number of bytes in this string) japanese = "桜の花びらたち" print japanese print len(japanese)#return 21 instead of 7 Is there any package or function like mb_strlen in PHP? 回答1: Use Unicode strings: # Encoding: UTF-8 japanese = u"桜の花びらたち" print japanese print len

How to detect and echo the last vowel in a word?

阅读更多关于 How to detect and echo the last vowel in a word?

问题 $word = "Acrobat" (or Apple, Tea etc.) How can I detect and echo the last vowel of a given word with php? I tried preg_match function, google'd for hours but couldn't find a proper solution. There can be multibyte letters like ü, ö in the string. 回答1: Here's a multibyte safe version of catching the last vowel in a string. $arr = array( 'Apple','Tea','Strng','queue', 'asartä','nő','ağır','NOËL','gør','æsc' ); /* these are the ones I found in character viewer in Mac so these vowels can be