cjk

Python: any way to perform this “hybrid” split() on multi-lingual (e.g. Chinese & English) strings?

徘徊边缘 提交于 2019-12-02 18:53:34
I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don't (Chinese, Japanese, Korean). Given such a string, I want to separate the English/French/etc part into words using whitespace as separator, and to separate the Chinese/Japanese/Korean part into individual characters. And I want to put of all those separated components into a list. Some examples would probably make this clear: Case 1 : English-only string. This case is easy: >>> "I love Python".split() ['I', 'love', 'Python'] Case 2 : Chinese-only

Convert or extract TTC font to TTF - how to?

ぃ、小莉子 提交于 2019-12-02 18:53:02
I am already more than 8 hours trying to make the STHeiti Medium.ttc.zip font work on Windows. But I can't make it work. Is anybody able to make it work on Windows? Assuming that Windows doesn't really know how to deal with TTC files (which I honestly find strange), you can "split" the combined fonts in an easy way if you use fontforge . The steps are: Download the file. Unzip it (e.g., unzip "STHeiti Medium.ttc.zip" ). Load Fontforge. Open it with Fontforge (e.g., File > Open ). Fontforge will tell you that there are two fonts "packed" in this particular TTC file (at least as of 2014-01-29)

Converting chinese to pinyin

旧巷老猫 提交于 2019-12-02 18:39:15
I've found places on the web such as http://www.chinesetopinyin.com/ that convert Chinese characters to pinyin (romanization). Does anyone know how to do this, or have a database that can be parsed? EDIT: I'm using C# but would actually prefer a database/flatfile. possible solution using Python : I think that Unicode database contains pinyin romanizations for chinese characters, but these are not included in unicodedata module data. however, you can use some external libraries, like cjklib , example: # coding: UTF-8 import cjklib from cjklib.characterlookup import CharacterLookup c = u'好' cjk

Word break in languages without spaces between words (e.g., Asian)?

时光毁灭记忆、已成空白 提交于 2019-12-02 18:11:09
I'd like to make MySQL full text search work with Japanese and Chinese text, as well as any other language. The problem is that these languages and probably others do not normally have white space between words. Search is not useful when you must type the same sentence as is in the text. I can not just put a space between every character because English must work too. I would like to solve this problem with PHP or MySQL. Can I configure MySQL to recognize characters which should be their own indexing units? Is there a PHP module that can recognize these characters so I could just throw spaces

Programming tips with Japanese Language/Characters [closed]

本小妞迷上赌 提交于 2019-12-02 17:47:19
I have an idea for a few web apps to write to help me, and maybe others, learn Japanese better since I am studying the language. My problem is the site will be in mostly english, so it needs to mix fluently Japanese Characters, usually hirigana and katakana, but later kanji. I am getting closer to accomplishing this; I have figured out that the pages and source files need to be unicode and utf-8 content types. However, my problem comes in the actual coding. What I need is to manipulate strings of text that are kana. One example is: けす I need to take that verb and convert it to the te-form けして.

Django: How to add Chinese support to the application

泪湿孤枕 提交于 2019-12-02 17:45:46
I am trying to add a Chinese language to my application written in Django and I have a really hard time with that. I have spent half a day trying different approaches, no success. My application supports few languages, this is part of settings.py file: TIME_ZONE = 'Europe/Dublin' LANGUAGE_CODE = 'en' LOCALES = ( #English ('en', u'English'), #Norwegian ('no', u'Norsk'), #Finish ('fi', u'Suomi'), #Simplified Chinese ('zh-CN', u'简体中文'), #Traditional Chinese ('zh-TW', u'繁體中文'), #Japanese ('ja', u'日本語'), ) At the moment all (but Chinese) languages work perfectly. This is a content of locale

Display japanese Text with furigana in UILabel

こ雲淡風輕ζ 提交于 2019-12-02 15:57:56
问题 for my app, few month ago, i've take the code from this site to use CTRubyAnnotation. This code, with few changes to make work with swift 4, work perfectly. from this work I've create a class in which I written a function to use that code. this is the class in swift 4 import UIKit extension String { func find(pattern: String) -> NSTextCheckingResult? { do { let re = try NSRegularExpression(pattern: pattern, options: []) return re.firstMatch( in: self, options: [], range: NSMakeRange(0, self

Unicode printing on a PRINTER in VB6

家住魔仙堡 提交于 2019-12-02 14:51:22
问题 I'm trying to print a Unicode (Chinese) string on a printer (well, actually PDFCreator) but all I get is a VERTICAL print of the characters. I use the TextOutW function imported from gdi32.dll : TextOutW dest.hDC, x, y, StrConv(szText, vbUnicode), Len(szText) And if I try to print "0.12" (if I print Chinese characters, I get the same result anyway), I get 0 . 1 2 If I use the dest.Print function, I am not able to print Unicode. Anyway, TextOutW works WONDERFULLY on the screen. Can anyone help

Unicode characters necessary for Japanese, Korean, and Chinese

与世无争的帅哥 提交于 2019-12-02 12:43:38
问题 I'm trying to answer these basic questions without getting a degree in linguistics and early human history, which seems to be where every google search has lead. Which unicode characters are necessary to include in a font in order to support rendering of Japanese language text? Which unicode characters are necessary to include in a font in order to support rendering of Chinese language text? Which unicode characters are necessary to include in a font in order to support rendering of Korean

How do browsers deal with “Tofu” characters

喜欢而已 提交于 2019-12-02 12:42:11
问题 character. I am using the Orbitron font in a hybrid Cordova/Android app that I am creating - quite simply because it is compact and has the clean, futuristic look that I am after. However, I realized not so long ago that Orbitron is a rather limited font with support for little more than the basic latin character set. I was about to embark on a switch to the Noto * family of fonts that have been created by Google so there is No more Tofu - tofu beign the term used by typographers to describe