Can sorting Japanese kanji words be done programmatically?

后端 未结 4 1471
春和景丽
春和景丽 2020-12-14 09:19

I\'ve recently discovered, to my astonishment (having never really thought about it before), machine-sorting Japanese proper nouns is apparently not possible.

I work

相关标签:
4条回答
  • 2020-12-14 09:19

    just a quick followup to explain the eventual actual solution we used. Thanks to all who recommended mecab--this appears to have done the trick.

    We have a mostly-Rails backend, but in our circumstance we didn't need to solve this problem on the backend. For user-entered data, e.g. creating new entities with Japanese names, we modified the UI to require the user to enter the phonetic yomigana in addition to the kanji name. Users seem accustomed to this. The problem was the large corpus of data that is built into the app--hospital, company, and place names, mainly.

    So, what we did is:

    1. We converted all the source data (a list of 4000 hospitals with name, address, etc) into .csv format (encoded as UTF-8, of course).
    2. Then, for developer use, we wrote a ruby script that:
      1. Uses mecab to translate the contents of that file into Japanese phonetic readings (the precise command used was mecab -Oyomi -o seed_hospitals.converted.csv seed_hospitals.csv, which outputs a new file with the kanji replaced by the phonetic equivalent, expressed in full-width katakana).
      2. Standardizes all yomikata into hiragana (because users tend to enter hiragana when manually entering yomikata, and hiragana and katakana sort differently). Ruby makes this easy once you find it: NKF.nkf("-h1 -w", katakana_str) # -h1 means to hiragana, -w means output utf8
      3. Using the awesomely conveninent new Ruby 1.9.2 version of CSV, combine the input file with the mecab-translated file, so that the resulting file now has extra columns inserted, a la NAME, NAME_YOMIGANA, ADDRESS, ADDRESS_YOMIGANA, and so on.
    3. Use the data from the resulting .csv file to seed our rails app with its built-in values.

    From time to time the client updates the source data, so we will need to do this whenever that happens.

    As far as I can tell, this output is good. My Japanese isn't good enough to be 100% sure, but a few of my Japanese coworkers skimmed it and said it looks all right. I put a slightly obfuscated sample of the converted addresses in this gist so that anybody who cared to read this far can see for themselves.

    UPDATE: The results are in... it's pretty good, but not perfect. Still, it looks like it correctly phoneticized 95%+ of the quasi-random addresses in my list.

    Many thanks to all who helped me!

    0 讨论(0)
  • 2020-12-14 09:34

    Nice to hear people are working with Japanese.

    I think you're spot on with your assessment of the problem difficulty. I just asked one of the Japanese guys in my lab, and the way to do it seems to be as you describe:

    1. Take a list of Kanji
    2. Infer (guess) the yomigana
    3. Sort yomigana by gojuon.

    The hard part is obviously step two. I have two guys in my lab: 高橋 and 高谷. Naturally, when sorting reports etc. by name they appear nowhere near each other.

    EDIT

    If you're fluent in Japanese, have a look here: http://mecab.sourceforge.net/

    It's a pretty popular tool, so you should be able to find English documentation too (the man page for mecab has English info).

    0 讨论(0)
  • 2020-12-14 09:35

    For Data, dig Google's Japanese IME (Mozc) data files here.

    • http://mozc.googlecode.com/svn/trunk/src/data/

    There is lots of interesting data there, including IPA dictionaries.

    Edit:

    And you may also try Mecab, it can use IPA dictionary and can convert kanjis to katakana for most of the words

    • http://mecab.sourceforge.net/#format

    and there is ruby bindings for that too.

    • http://mecab.sourceforge.net/bindings.html

    and here is somebody tested, ruby with mecab with tagger -Oyomi

    • http://hirai2.blog129.fc2.com/blog-entry-4.html
    0 讨论(0)
  • 2020-12-14 09:46

    I'm not familiar with MeCab, but I think using MeCab is good idea.

    Then, I'll introduce another method. If your app is written in Microsoft VBA, you can call "GetPhonetic" function. It's easy to use.

    see : http://msdn.microsoft.com/en-us/library/aa195745(v=office.11).aspx


    Sorting prefectures by its pronunciation is not common. Most Japanese are used to prefectures sorted by 「都道府県コード」. e.g. 01:北海道, 02:青森県, …, 13:東京都, …, 27:大阪府, …, 47:沖縄県 These codes are defined in "JIS X 0401" or "ISO-3166-2 JP". see (Wikipedia Japanese) : http://ja.wikipedia.org/wiki/%E5%85%A8%E5%9B%BD%E5%9C%B0%E6%96%B9%E5%85%AC%E5%85%B1%E5%9B%A3%E4%BD%93%E3%82%B3%E3%83%BC%E3%83%89

    0 讨论(0)
提交回复
热议问题