How can I detect japanese text in a Java string?

放肆的年华 提交于 2020-01-02 05:21:50

问题


I need to be able to detect Japanese characters in a Java string.

Currently I'm getting the UnicodeBlock and checking to see if it's equal to Character.UnicodeBlock.KATAKANA or Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS, but I'm not 100% that's going to cover everything.

Any suggestions?


回答1:


I use the following java method. Might not completely address your requirement though.

<!-- language: lang-java -->
/**
 * Returns if a character is one of Chinese-Japanese-Korean characters.
 * 
 * @param c
 *            the character to be tested
 * @return true if CJK, false otherwise
 */
private boolean isCharCJK(final char c) {
    if ((Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_COMPATIBILITY_FORMS)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_RADICALS_SUPPLEMENT)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.ENCLOSED_CJK_LETTERS_AND_MONTHS)) {
        return true;
    }
    return false;
}

Futhermore, these seem they should work for Hiragana and Katakana characters:

private boolean isHiragana(final char c)
{
     return (Character.UnicodeBlock.of(c)==Character.UnicodeBlock.HIRAGANA);
}

private boolean isKatakana(final char c)
{
     return (Character.UnicodeBlock.of(c)==Character.UnicodeBlock.KATAKANA);
}



回答2:


According regular-expressions.info, Japanese isn't made of one script: "There is no Japanese Unicode script. Instead, Unicode offers the Hiragana, Katakana, Han and Latin scripts that Japanese documents are usually composed of."

In which case, this regex should do the trick:

yourString.matches("[\\p{Hiragana}\\p{Katakana}\\p{Han}\\p{Latin}]*+")


来源:https://stackoverflow.com/questions/1499804/how-can-i-detect-japanese-text-in-a-java-string

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!