multibyte

PHP Multibyte String Functions

做~自己de王妃 提交于 2019-12-29 08:46:09
问题 Today I ran into a problem with the php function strpos() because it returned FALSE even if the correct result was obviously 0. This was because one parameter was encoded in UTF-8, but the other (origin is a HTTP GET parameter) obviously not. Now I have noticed that using the mb_strpos function solved my problem. My question is now: Is it wisely to use the PHP multibyte string functions generally to avoid theses problems in future? Should I avoid the traditional strpos , strlen , ereg , etc.,

Ruby 1.9: how can I properly upcase & downcase multibyte strings?

大憨熊 提交于 2019-12-27 17:07:55
问题 So matz made the decision to keep upcase and downcase limited to /[A-Z]/i in ruby 1.9.1. ActiveSupport::Multibyte has long had great i18n case jiggering in ruby 1.8.x via String#mb_chars . However, when tried under ruby 1.9.1, it doesn't seem to work. Here's a simple test script I wrote, along with the output I'm getting: $ cat test.rb # encoding: UTF-8 puts("@ #{RUBY_VERSION} " + (__ENCODING__ rescue $KCODE).to_s) sd, su = "Iñtërnâtiônàlizætiøn", "IÑTËRNÂTIÔNÀLIZÆTIØN" def ps(u, d, k); puts

What does constitute one character for regcomp? Which multibyte encoding does determine this?

血红的双手。 提交于 2019-12-25 07:51:02
问题 regcomp (from glibc) is a POSIX function for compiling regular expressions. int regcomp(regex_t *restrict preg, const char *restrict pattern, int cflags); There are some constructions in regular expressions which depend on the idea of a single character, for example [abc] . If a multibyte encoding is used and a multibyte letter is used in the expression, the interpretation would be different if it treated either as a byte-sequence or a sequence of multibyte letters. Here I illustrate this

Check if file contains multibyte character

旧街凉风 提交于 2019-12-23 12:40:46
问题 I have some subtitle files in UTF-8. Sometimes there are some sporadic multibyte characters in these files which cause problem in some applications. How do I check in linux (and possibility locate these) if a certain file contains any multibyte character. 回答1: You can use file command chalet16$ echo test > a.txt chalet16$ echo testก > b.txt #One of Thai characters chalet16$ file *.txt a.txt: ASCII text b.txt: UTF-8 Unicode text 回答2: You can use file or chardet command. 来源: https:/

Converting accented characters in PostgreSQL?

谁说胖子不能爱 提交于 2019-12-23 03:19:47
问题 Is there an existing function to replace accented characters with unadorned characters in PostgreSQL? Characters like å and ø should become a and o respectively. The closest thing I could find is the translate function, given the example in the comments section found here. Some commonly used accented characters can be searched using the following function: translate(search_terms, '\303\200\303\201\303\202\303\203\303\204\303\205\303\206\303\207\303\210\303\211\303\212\303\213\303\214\303\215

strip out multi-byte white space from a string PHP

微笑、不失礼 提交于 2019-12-22 04:43:36
问题 I am trying to use a preg_replace to eliminate the Japanese full-width white space "   " from a string input but I end up with a corrupted multi-byte string. I would prefer to preg_replace instead of str_replace. Here is a sample code: $keywords = ' ラメ単色'; $keywords = str_replace(array(' ', ' '), ' ', urldecode($keywords)); // outputs :'ラメ単色' $keywords = preg_replace("@[  ]@", ' ',urldecode($keywords)); // outputs :'�� ��単色' Anyone has any idea as to why this is so and how to remedy this

Does multibyte character interfere with end-line character within a regex?

两盒软妹~` 提交于 2019-12-20 11:52:44
问题 With this regex: regex1 = /\z/ the following strings match: "hello" =~ regex1 # => 5 "こんにちは" =~ regex1 # => 5 but with these regexes: regex2 = /#$/?\z/ regex3 = /\n?\z/ they show difference: "hello" =~ regex2 # => 5 "hello" =~ regex3 # => 5 "こんにちは" =~ regex2 # => nil "こんにちは" =~ regex3 # => nil What is interfering? The string encoding is UTF-8, and the OS is Linux (i.e., $/ is "\n" ). Are the multibyte characters interfering with $/ ? How? 回答1: The problem you reported is definitely a bug of

How to check if the word is Japanese or English using PHP

孤者浪人 提交于 2019-12-20 08:30:08
问题 I want to have different process for English word and Japanese word in this function function process_word($word) { if($word is english) { ///////// }else if($word is japanese) { //////// } } thank you 回答1: A quick solution that doesn't need the mb_string extension: if (strlen($str) != strlen(utf8_decode($str))) { // $str uses multi-byte chars (isn't English) } else { // $str is ASCII (probably English) } Or a modification of the solution provided by @Alexander Konstantinov: function isKanji(

How to handle multibyte string in Python

和自甴很熟 提交于 2019-12-19 08:07:25
问题 There are multibyte string functions in PHP to handle multibyte string (e.g:CJK script). For example, I want to count how many letters in a multi bytes string by using len function in python, but it return an inaccurate result (i.e number of bytes in this string) japanese = "桜の花びらたち" print japanese print len(japanese)#return 21 instead of 7 Is there any package or function like mb_strlen in PHP? 回答1: Use Unicode strings: # Encoding: UTF-8 japanese = u"桜の花びらたち" print japanese print len

How to detect and echo the last vowel in a word?

好久不见. 提交于 2019-12-19 05:10:41
问题 $word = "Acrobat" (or Apple, Tea etc.) How can I detect and echo the last vowel of a given word with php? I tried preg_match function, google'd for hours but couldn't find a proper solution. There can be multibyte letters like ü, ö in the string. 回答1: Here's a multibyte safe version of catching the last vowel in a string. $arr = array( 'Apple','Tea','Strng','queue', 'asartä','nő','ağır','NOËL','gør','æsc' ); /* these are the ones I found in character viewer in Mac so these vowels can be