cjk

Encoding error in Python with Chinese characters

送分小仙女□ 提交于 2019-11-26 20:47:20
问题 I'm a beginner having trouble decoding several dozen CSV file with numbers + (Simplified) Chinese characters to UTF-8 in Python 2.7. I do not know the encoding of the input files so I have tried all the possible encodings I am aware of -- GB18030, UTF-7, UTF-8, UTF-16 & UTF-32 (LE & BE). Also, for good measure, GBK and GB3212, though these should be a subset of GB18030. The UTF ones all stop when they get to the first Chinese characters. The other encodings stop somewhere in the first line

Simplified Chinese Unicode table

 ̄綄美尐妖づ 提交于 2019-11-26 20:28:57
问题 Where can I find a Unicode table showing only the simplified Chinese characters? I have searched everywhere but cannot find anything. UPDATE : I have found that there is another encoding called GB 2312 - http://en.wikipedia.org/wiki/GB_2312 - which contains only simplified characters. Surely I can use this to get what I need? I have also found this file which maps GB2312 to Unicode - http://cpansearch.perl.org/src/GUS/Unicode-UTF8simple-1.06/gb2312.txt - but I'm not sure if it's accurate or

Find all Chinese text in a string using Python and Regex

£可爱£侵袭症+ 提交于 2019-11-26 19:46:57
I needed to strip the Chinese out of a bunch of strings today and was looking for a simple Python regex. Any suggestions? cryo The short, but relatively comprehensive answer for narrow Unicode builds of python (excluding ordinals > 65535 which can only be represented in narrow Unicode builds via surrogate pairs): RE = re.compile(u'[⺀-⺙⺛-⻳⼀-⿕々〇〡-〩〸-〺〻㐀-䶵一-鿃豈-鶴侮-頻並-龎]', re.UNICODE) nochinese = RE.sub('', mystring) The code for building the RE, and if you need to detect Chinese characters in the supplementary plane for wide builds: # -*- coding: utf-8 -*- import re LHan = [[0x2E80, 0x2E99], # Han

What are the most common non-BMP Unicode characters in actual use? [closed]

若如初见. 提交于 2019-11-26 19:40:51
In your experience which Unicode characters, codepoints, ranges outside the BMP (Basic Multilingual Plane) are the most common so far? These are the ones which require 4 bytes in UTF-8 or surrogates in UTF-16. I would've expected the answer to be Chinese and Japanese characters used in names but not included in the most widespread CJK multibyte character sets, but on the project I do most work on, the English Wiktionary, we have found that the Gothic alphabet is far more common so far. UPDATE I've written a couple of software tools to scan entire Wikipedias for non-BMP characters and found to

How to determine if a character is a Chinese character

耗尽温柔 提交于 2019-11-26 18:14:16
问题 How to determine if a character is a Chinese character using ruby? 回答1: An interesting article on encodings in Ruby: http://blog.grayproductions.net/articles/bytes_and_characters_in_ruby_18 (it's part of a series - check the table of contents at the start of the article also) I haven't used chinese characters before but this seems to be the list supported by unicode: http://en.wikipedia.org/wiki/List_of_CJK_Unified_Ideographs . Also take note that it's a unified system including Japanese and

Detect Windows font size (100%, 125%, and 150%)

回眸只為那壹抹淺笑 提交于 2019-11-26 18:10:28
问题 I created an application that works perfectly until the user selects 125% or 150%. It would break my application. I later found a way to find the font size by detecting the DPI. This was working great until people with Chinese versions of Windows 7 started using my application. The entire application breaks on Chinese Windows 7. From what I can tell (I can't really test it for I only have the English version and installation the language packs does not cause the problem) Chinese characters

UTF-8 file output in R

我的梦境 提交于 2019-11-26 17:54:34
I'm using R 2.15.0 on Windows 7 64-bit. I would like to output unicode (CJK) text to a file. The following code shows how a Unicode character sent to write on a UTF-8 file connection does not work as (I) expected: rty <- file("test.txt",encoding="UTF-8") write("在", file=rty) close(rty) rty <- file("test.txt",encoding="UTF-8") scan(rty,what=character()) close(rty) As shown by the output of scan: Read 1 item [1] "<U+5728>" The file was not written with the UTF character itself, but some kind of ANSI-compliant fallback. Can I make it work right the first time (i.e. with a text file that has "在"

Java regex for support Unicode?

时光怂恿深爱的人放手 提交于 2019-11-26 12:54:10
To match A to Z, we will use regex: [A-Za-z] How to allow regex to match utf8 characters entered by user? For example Chinese words like 环保部 stema What you are looking for are Unicode properties. e.g. \p{L} is any kind of letter from any language So a regex to match such a Chinese word could be something like \p{L}+ There are many such properties, for more details see regular-expressions.info Another option is to use the modifier Pattern.UNICODE_CHARACTER_CLASS In Java 7 there is a new property Pattern.UNICODE_CHARACTER_CLASS that enables the Unicode version of the predefined character classes

What are the most common non-BMP Unicode characters in actual use? [closed]

强颜欢笑 提交于 2019-11-26 12:16:54
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 5 years ago . In your experience which Unicode characters, codepoints, ranges outside the BMP (Basic Multilingual Plane) are the most common so far? These are the ones which require 4 bytes in UTF-8 or surrogates in UTF-16. I would\'ve expected the answer to be Chinese and Japanese characters

how to print chinese word in my code.. using python

拥有回忆 提交于 2019-11-26 11:03:39
问题 This is my code: print \'哈哈\'.decode(\'gb2312\').encode(\'utf-8\') ...and it prints: SyntaxError: Non-ASCII character \'\\xe5\' in file D:\\zjm_code\\a.py on line 2, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details How do I print \'哈哈\'? Update: When I use the following code: #!/usr/bin/python # -*- coding: utf-8 -*- print \'哈哈\' ... it prints 鍝堝搱 . That isn\'t what I wanted to get. My IDE is Ulipad, is this a bug with the IDE? Second Update: This code will