multibyte

How to handle multibyte string in Python

為{幸葍}努か 提交于 2019-12-01 05:57:13
There are multibyte string functions in PHP to handle multibyte string (e.g:CJK script). For example, I want to count how many letters in a multi bytes string by using len function in python, but it return an inaccurate result (i.e number of bytes in this string) japanese = "桜の花びらたち" print japanese print len(japanese)#return 21 instead of 7 Is there any package or function like mb_strlen in PHP? Use Unicode strings : # Encoding: UTF-8 japanese = u"桜の花びらたち" print japanese print len(japanese) Note the u in front of the string. To convert a bytestring into Unicode, use decode : "桜の花びらたち".decode(

How to detect and echo the last vowel in a word?

冷暖自知 提交于 2019-12-01 02:01:41
$word = "Acrobat" (or Apple, Tea etc.) How can I detect and echo the last vowel of a given word with php? I tried preg_match function, google'd for hours but couldn't find a proper solution. There can be multibyte letters like ü, ö in the string. Here's a multibyte safe version of catching the last vowel in a string. $arr = array( 'Apple','Tea','Strng','queue', 'asartä','nő','ağır','NOËL','gør','æsc' ); /* these are the ones I found in character viewer in Mac so these vowels can be extended. don't forget to add both lower and upper case versions of new ones because personally I wouldn't rely

How to convert some multibyte characters into its numeric html entity using PHP?

六月ゝ 毕业季﹏ 提交于 2019-11-30 16:12:54
问题 Test string: $s = "convert this: "; $s .= "–, —, †, ‡, •, ≤, ≥, μ, ₪, ©, ® y ™, ⅓, ⅔, ⅛, ⅜, ⅝, ⅞, ™, Ω, ℮, ∑, ⌂, ♀, ♂ "; $s .= "but, not convert ordinary characters to entities"; 回答1: $encoded = mb_convert_encoding($s, 'HTML-ENTITIES', 'UTF-8'); asssuming your input string is UTF-8, this should encode most everything into numeric entities. 回答2: Well htmlentities doesn't work correctly. Fortunately someone has posted code on the php website that seems to do the translation of multibyte

How to convert some multibyte characters into its numeric html entity using PHP?

半城伤御伤魂 提交于 2019-11-30 16:12:27
Test string: $s = "convert this: "; $s .= "–, —, †, ‡, •, ≤, ≥, μ, ₪, ©, ® y ™, ⅓, ⅔, ⅛, ⅜, ⅝, ⅞, ™, Ω, ℮, ∑, ⌂, ♀, ♂ "; $s .= "but, not convert ordinary characters to entities"; $encoded = mb_convert_encoding($s, 'HTML-ENTITIES', 'UTF-8'); asssuming your input string is UTF-8, this should encode most everything into numeric entities. Well htmlentities doesn't work correctly. Fortunately someone has posted code on the php website that seems to do the translation of multibyte characters properly I did work on decoding ascii into html coded text (&#xxxx). https://github.com/hellonearthis

Fastest bitwise xor between two multibyte binary data variables

百般思念 提交于 2019-11-30 12:37:45
What is the fastest way to implementat the following logic: def xor(data, key): l = len(key) buff = "" for i in range(0, len(data)): buff += chr(ord(data[i]) ^ ord(key[i % l])) return buff In my case key is 20-byte sha1 digest, and data is some binary data between 20 bytes and few (1, 2, 3) megabytes long UPDATE: OK guys. Here's a 3.5 times faster implementation, which splits data and key by chunks of 4, 2 or 1 bytes (in my case, most of the time it's 4-byte long integer): def xor(data, key): index = len(data) % 4 size = (4, 1, 2, 1)[index] type = ('L', 'B', 'H', 'B')[index] key_len = len(key)

What is a multibyte character set?

不想你离开。 提交于 2019-11-30 04:45:24
Does the term multibyte refer to a charset whose characters can - but don't have to be - wider than 1 byte, (e.g. UTF-8) or does it refer to character sets which are in any case wider than 1 byte (e.g. UTF-16) ? In other words: What is meant if anybody talks about multibyte character sets? The term is ambiguous, but in my internationalization work, we typically avoided the term "multibyte character sets" to refer to Unicode-based encodings. Generally, we used the term only for legacy encoding schemes that had one or more bytes to define each character (excluding encodings that require only one

Split a sentence into separate words

守給你的承諾、 提交于 2019-11-30 00:33:54
I need to split a Chinese sentence into separate words. The problem with Chinese is that there are no spaces. For example, the sentence may look like: 主楼怎么走 (with spaces it would be: 主楼 怎么 走 ). At the moment I can think of one solution. I have a dictionary with Chinese words (in a database). The script will: try to find the first two characters of the sentence in the database ( 主楼 ), if 主楼 is actually a word and it's in the database the script will try to find first three characters ( 主楼怎 ). 主楼怎 is not a word, so it's not in the database => my application now knows that 主楼 is a separate word.

Fastest bitwise xor between two multibyte binary data variables

我与影子孤独终老i 提交于 2019-11-29 17:35:04
问题 What is the fastest way to implementat the following logic: def xor(data, key): l = len(key) buff = "" for i in range(0, len(data)): buff += chr(ord(data[i]) ^ ord(key[i % l])) return buff In my case key is 20-byte sha1 digest, and data is some binary data between 20 bytes and few (1, 2, 3) megabytes long UPDATE: OK guys. Here's a 3.5 times faster implementation, which splits data and key by chunks of 4, 2 or 1 bytes (in my case, most of the time it's 4-byte long integer): def xor(data, key):

check if is multibyte string in PHP

梦想的初衷 提交于 2019-11-29 15:33:03
问题 I want to check if is a string type multibyte on PHP. Have any idea how to accomplish this? Example: <?php! $string = "I dont have idea that is what i am..."; if( is_multibyte( $string ) ) { echo 'yes!!'; }else{ echo 'ups!'; } ?> Maybe( rule 8 bytes ): <?php if( mb_strlen( $string ) > strlen() ) { return true; } else { return false; } ?> I read: Variable width encoding - WIKI and UTF-8 - WIKI 回答1: There are two interpretations. The first is that every character is multibyte. The second is

mb_detect_encoding detects ASCII as UTF-8?

試著忘記壹切 提交于 2019-11-29 07:19:52
I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions. Currently it looks like this: $val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val)); However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed. I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect