Differentiating CJK languages (Chinese, Japanese, Korean) in Android

后端 未结 1 415
傲寒
傲寒 2020-12-18 04:04

I want to be able to recognize Chinese, Japanese, and Korean written characters, both as a general group and as subdivided languages. These are the reasons:

  • <
相关标签:
1条回答
  • 2020-12-18 04:27

    Unicode

    CJK (and CJKV) in Unicode refers to Han Ideographs, that is, the Chinese characters (汉字) used in Chinese, Japanese, Korean, and Vietnamese. For the Unicode script naming, it does not refer to the phonetic written scripts like Japanese Katakana and Hiragana or Korean Hangul. The Han Ideagraphs are said to be unified. By that they mean that there is only one Unicode codepoint for each ideograph, no matter which language it is used in.

    This means that Unicode (and conversely Android/Java) provides no way to determine the language based upon a single ideograph alone. Even the Chinese Simplified/Traditional characters are not readily differentiated from the encoding. This is the same idea as not being able to know if the character "a" belongs to English, French, or Spanish. More context is needed to determine that.

    However, you can use the Unicode encoding to determine Japanese Hiragana/Katakana and Korean Hangul. And the presence of such characters would be a good indication that nearby Han Ideographs belong to the same language.

    Android

    You can find the codepoint at some index with

    int codepoint = Character.codePointAt(myString, offset)
    

    And if you wanted to iterate through the codepoints in a string:

    final int length = myString.length();
    for (int offset = 0; offset < length; ) {
        final int codepoint = Character.codePointAt(myString, offset);
    
        // use codepoint here
    
        offset += Character.charCount(codepoint);
    }
    

    Once you have the codepoint you can look up which code block it is in with

    Character.UnicodeBlock block = Character.UnicodeBlock.of(codepoint);
    

    And then you can use the codeblock to test for the ideograph or language.

    CJK

    Scanning the Unicode code blocks, I think these cover all the CJK ideograms. If I missed any, then feel free to edit my answer or leave a comment.

    private boolean isCJK(int codepoint) {
        Character.UnicodeBlock block = Character.UnicodeBlock.of(codepoint);
        return (
                Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS.equals(block)||
                Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A.equals(block) ||
                Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B.equals(block) ||
                Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_C.equals(block) || // api 19
                Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_D.equals(block) || // api 19
                Character.UnicodeBlock.CJK_COMPATIBILITY.equals(block) ||
                Character.UnicodeBlock.CJK_COMPATIBILITY_FORMS.equals(block) ||
                Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS.equals(block) ||
                Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS_SUPPLEMENT.equals(block) ||
                Character.UnicodeBlock.CJK_RADICALS_SUPPLEMENT.equals(block) ||
                Character.UnicodeBlock.CJK_STROKES.equals(block) ||                        // api 19
                Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION.equals(block) ||
                Character.UnicodeBlock.ENCLOSED_CJK_LETTERS_AND_MONTHS.equals(block) ||
                Character.UnicodeBlock.ENCLOSED_IDEOGRAPHIC_SUPPLEMENT.equals(block) ||    // api 19
                Character.UnicodeBlock.KANGXI_RADICALS.equals(block) ||
                Character.UnicodeBlock.IDEOGRAPHIC_DESCRIPTION_CHARACTERS.equals(block));
    }
    

    The ones with comments (scroll right) are only available from API level 19. However, these could probably be safely removed if you need to support earlier versions since they are only rarely used. Also, Unicode defines a CJK Extension E, but at the time of this writing it is not supported in Android/Java. If you definitely need to include everything, then you can compare the codepoints to the Unicode block ranges directly. This site is a convenient place to browse them. You can also see them at the Unicode site.

    If you don't need to support below API 19, then isIdeographic makes the test very easy (though I don't know if it returns exactly the same matches as the method above).

    private boolean isCJK(int codepoint) {
        return Character.isIdeographic(codepoint);
    }
    

    Or this one for API 24+:

    private boolean isCJK(int codepoint) {
        return (Character.UnicodeScript.of(codepoint) == Character.UnicodeScript.HAN);
    }
    

    Japanese

    For testing Hiragana or Katakana this should work fine:

    private boolean isJapaneseKana(int codepoint) {
        Character.UnicodeBlock block = Character.UnicodeBlock.of(codepoint);
        return (
                Character.UnicodeBlock.HIRAGANA.equals(block) ||
                Character.UnicodeBlock.KATAKANA.equals(block) ||
                Character.UnicodeBlock.KATAKANA_PHONETIC_EXTENSIONS.equals(block));
    }
    

    Or this if you are supporting API 24+:

    (This needs more testing. See comment below.)

    private boolean isJapaneseKana(int codepoint) {
        return (Character.UnicodeScript.of(codepoint) == Character.UnicodeScript.HIRAGANA || 
                Character.UnicodeScript.of(codepoint) == Character.UnicodeScript.KATAKANA);
    }
    

    Korean

    To test for Hangul on lower APIs you can use

    private boolean isKoreanHangul(int codepoint) {
        Character.UnicodeBlock block = Character.UnicodeBlock.of(codepoint);
        return (Character.UnicodeBlock.HANGUL_JAMO.equals(block) ||
                Character.UnicodeBlock.HANGUL_JAMO_EXTENDED_A.equals(block) || // api 19
                Character.UnicodeBlock.HANGUL_JAMO_EXTENDED_B.equals(block) || // api 19
                Character.UnicodeBlock.HANGUL_COMPATIBILITY_JAMO.equals(block) ||
                Character.UnicodeBlock.HANGUL_SYLLABLES.equals(block));
    }
    

    Remove the lines marked API 19 if necessary.

    Or for API 24+:

    private boolean isKoreanHangul(int codepoint) {
        return (Character.UnicodeScript.of(codepoint) == Character.UnicodeScript.HANGUL);
    }
    

    Further study

    • Unicode East Asian scripts
    • Unicode CJK FAQs
    • Unicode Korean FAQs
    • Some source code that shows how Character.UnicodeScript works
    • CJK Unified Ideographs
    0 讨论(0)
提交回复
热议问题