cjk

Encoding error in Python with Chinese characters

六眼飞鱼酱① 提交于 2019-11-27 21:41:54
I'm a beginner having trouble decoding several dozen CSV file with numbers + (Simplified) Chinese characters to UTF-8 in Python 2.7. I do not know the encoding of the input files so I have tried all the possible encodings I am aware of -- GB18030, UTF-7, UTF-8, UTF-16 & UTF-32 (LE & BE). Also, for good measure, GBK and GB3212, though these should be a subset of GB18030. The UTF ones all stop when they get to the first Chinese characters. The other encodings stop somewhere in the first line except GB18030. I thought this would be the solution because it read through the first few files and

Simplified Chinese Unicode table

依然范特西╮ 提交于 2019-11-27 20:52:43
Where can I find a Unicode table showing only the simplified Chinese characters? I have searched everywhere but cannot find anything. UPDATE : I have found that there is another encoding called GB 2312 - http://en.wikipedia.org/wiki/GB_2312 - which contains only simplified characters. Surely I can use this to get what I need? I have also found this file which maps GB2312 to Unicode - http://cpansearch.perl.org/src/GUS/Unicode-UTF8simple-1.06/gb2312.txt - but I'm not sure if it's accurate or not. If that table isn't correct maybe someone could point me to one that is, or maybe just a table of

How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?

↘锁芯ラ 提交于 2019-11-27 20:13:15
I want to split a sentence into a list of words. For English and European languages this is easy, just use split() >>> "This is a sentence.".split() ['This', 'is', 'a', 'sentence.'] But I also need to deal with sentences in languages such as Chinese that don't use whitespace as word separator. >>> u"这是一个句子".split() [u'\u8fd9\u662f\u4e00\u4e2a\u53e5\u5b50'] Obviously that doesn't work. How do I split such a sentence into a list of words? UPDATE: So far the answers seem to suggest that this requires natural language processing techniques and that the word boundaries in Chinese are ambiguous. I'm

How to keep the Chinese or other foreign language as they are instead of converting them into codes?

耗尽温柔 提交于 2019-11-27 16:14:29
DOMDocument seems to convert Chinese characters into codes, for instance, 你的乱发 will become ä½ çš„ä¹±å‘ How can I keep the Chinese or other foreign language as they are instead of converting them into codes? Below is my simple test, $dom = new DOMDocument(); $dom->loadHTML($html); If I add this below before loadHTML(), $html = mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8"); I get, 你的乱发 Even though the coverted codes will be displayed as Chinese characters, 你的乱发 still are not 你的乱发 what I am after.... DOMDocument seems to convert Chinese characters into codes [...]. How can I keep the

Convert numbered to accentuated Pinyin?

我的未来我决定 提交于 2019-11-27 14:10:25
问题 Given a source text like nin2 hao3 ma (which is a typical way to write ASCII Pinyin, without proper accentuated characters) and given a (UTF8) conversion table like a1;ā e1;ē i1;ī o1;ō u1;ū ü1;ǖ A1;Ā E1;Ē ... how would I convert the source text into nín hǎo ma ? For what it's worth I'm using PHP, and this might be a regex I'm looking into? 回答1: <?php $in = 'nin2 hao3 ma'; $out = 'nín hǎo ma'; function replacer($match) { static $trTable = array( 1 => array( 'a' => 'ā', 'e' => 'ē', 'i' => 'ī',

Conversion from Simplified to Traditional Chinese

夙愿已清 提交于 2019-11-27 13:42:49
问题 If a website is localized/internationalized with a Simplified Chinese translation... Is it possible to reliably automatically convert the text to Traditional Chinese in a high quality way? If so, is it going to be extremely high quality or just a good starting point for a translator to tweak? Are there open source tools (ideally in PHP) to do such a conversion? Is the conversion better one way vs. the other (simplified -> traditional, or vice versa)? 回答1: Short answer: No, not reliably+high

How to determine if a character is a Chinese character

强颜欢笑 提交于 2019-11-27 13:20:19
How to determine if a character is a Chinese character using ruby? An interesting article on encodings in Ruby: http://blog.grayproductions.net/articles/bytes_and_characters_in_ruby_18 (it's part of a series - check the table of contents at the start of the article also) I haven't used chinese characters before but this seems to be the list supported by unicode: http://en.wikipedia.org/wiki/List_of_CJK_Unified_Ideographs . Also take note that it's a unified system including Japanese and Korean characters (some characters are shared between them) - not sure if you can distinguish which are

Detect Windows font size (100%, 125%, and 150%)

假装没事ソ 提交于 2019-11-27 12:18:37
I created an application that works perfectly until the user selects 125% or 150%. It would break my application. I later found a way to find the font size by detecting the DPI. This was working great until people with Chinese versions of Windows 7 started using my application. The entire application breaks on Chinese Windows 7. From what I can tell (I can't really test it for I only have the English version and installation the language packs does not cause the problem) Chinese characters are causing a weird DPI that breaks my application. My current code works like this: if (dpi.DpiX == 120)

PHP and C++ for UTF-8 code unit in reverse order in Chinese character

本秂侑毒 提交于 2019-11-27 08:46:14
问题 The unicode code point for the Chinese word 你好 is 4F60 , 597D respectively. which I got from this tool http://rishida.net/tools/conversion/ The console application below will print out the hexadecimal byte sequence of 你好 as 60:4F:7D:59 . As you can see it's in reverse order of the unicode code point for each character. 60 first then 4F, instead of 4F then 60. Why is it so ? Who is correct ? The tools or the console app ? Or both ? void printHex (char * buf, char *filename) { FILE *fp; fp

Detect chinese character using perl?

那年仲夏 提交于 2019-11-27 07:21:40
问题 Is there any way to detect Chinese characters using Perl? And is there any way on how to split Chinese characters with symbol dot '.' perfectly? 回答1: Depends on your particular notion of what is a Chinese character. Perhaps you're looking for /\p{Script=Hani}/ , but if we want to cast our net wide, the following regex pattern will match stuff that occurs in Chinese writing. Restrict if necessary. use 5.014; / (?: \p{Block=CJK_Compatibility} | \p{Block=CJK_Compatibility_Forms} | \p{Block=CJK