Regular expressions (regex) in Japanese

岁酱吖の 提交于 2019-11-30 04:55:47
slevithan

Python regexes offer limited support for Unicode features. Java is better, particularly Java 7.

Java supports Unicode categories. E.g., \p{L} (and its shorthand, \pL) matches any letter in any language. This includes Japanese ideographic characters.

Java 7 supports Unicode scripts, including the Hiragana, Katakana, Han, and Latin scripts that Japanese text is typically composed of. You can match any character in one of these scripts using \p{Han}, \p{Hiragana}, \p{Katakana}, and \p{Latin}. You can combine them in a character class such as [\p{Han}\p{Hiragana}\p{Katakana}]. You can use an uppercase P (as in, \P{Han}) to match any character except those in the Han script.

Java 7 supports Unicode blocks. Unless running your code in Android (where scripts are not available), you should generally avoid blocks, since they are less useful and accurate than Unicode scripts. There are a variety of blocks related to Japanese text, including \p{InHiragana}, \p{InKatakana}, \p{InCJK_Unified_Ideographs}, \p{InCJK_Symbols_and_Punctuation}, etc.

Both Java and Python can refer to individual code points using \uFFFF, where FFFF is any four-digit headecimal number. Java 7 can refer to any Unicode code point, including those beyond the Basic Multilingual Plane, using e.g. \x{10FFFF}. Python regexes don't support 21-bit Unicode, but Python strings do, so you can embed a a code point in a regex using e.g. \U0010FFFF (uppercase U followed by eight hex digits).

The Java 7 (?U) or UNICODE_CHARACTER_CLASS flag makes character class shorthands like \w and \d Unicode aware, so they will match Japanese ideographic characters, etc. (but note that \d will still not match kanji for numbers like 一二三四). Python 3 makes shorthand classes Unicode aware by default. In Python 2, shorthand classes are Unicode aware when you use the re.UNICODE or re.U flag.

You're right that not all regex ideas carry over equally well to all scripts. Some things (such as letter casing) just don't make sense with Japanese text.

akazah

For Python

#!/usr/bin/python
# -*- coding: utf-8 -*-

import re

kanji = u'漢字'
hiragana = u'ひらがな'
katakana = u'カタカナ'
str = kanji + hiragana + katakana

#Match Kanji
regex = u'[\u4E00-\u9FFF]+' # == u'[一-龠々]+'
match = re.search(regex, str, re.U)
print match.group().encode('utf-8') #=> 漢字

#Match Hiragana
regex = u'[\u3040-\u309Fー]+' # == u'[ぁ-んー]+'
match = re.search(regex, str, re.U)
print match.group().encode('utf-8') #=> ひらがな

#Match Katakana
regex = u'[\u30A0-\u30FF]+' # == u'[ァ-ヾ]+'
match = re.search(regex, str, re.U)
print match.group().encode('utf-8') #=>カタカナ

The Java character classes do something like what you are looking for. They are the ones that start with \p here.

In Unicode there are two ways to classify characters from different writing systems. They are

  • Unicode Script (all characters used in a script, regardless of Unicode code points - may come from different blocks)
  • Unicode Block (code point ranges used for a specific purpose/script - may span across scripts and scripts may span across blocks)

The differences between these are explained rather more clearly on this web page from the official Unicode website.

In terms of matching characters in regular expressions in Java, you can use either classification mechanism since Java 7.

This is the syntax, as indicated in this tutorial from the Oracle website:

Script:

either \p{IsHiragana} or \p{script=Hiragana}

Block:

either \p{InHiragana} or \p{block=Hiragana}

Note that in one case it's "Is", in the other it's "In".

The syntax \p{Hiragana} indicated in the accepted answer does not seem to be a valid option. I tried it just in case but can confirm that it did not work for me.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!