BreakIterator not working correctly with Chinese text

筅森魡賤 提交于 2019-12-23 20:14:44

问题


I used BreakIterator.getWordInstance to split a Chinese text into words. Here is my example

import java.text.BreakIterator;
import java.util.Locale;

public class Sample {
    public static void main(String[] args) {
        String stringToExamine = "I like to eat apples. 我喜欢吃苹果。";

        //print each word in order
        BreakIterator boundary = BreakIterator.getWordInstance(new Locale("zh", "CN"));
        boundary.setText(stringToExamine);

        printEachForward(boundary, stringToExamine);
    }

    public static void printEachForward(BreakIterator boundary, String source) {
        int start = boundary.first();
        for (int end = boundary.next(); end != BreakIterator.DONE; start = end, end = boundary.next()) {
            System.out.println(start + ": " + source.substring(start, end));
        }
    }
}

My example text is taken from https://stackoverflow.com/a/42219474/954439

The output that I get is

0: I
1:  
2: like
6:  
7: to
9:  
10: eat
13:  
14: apples
20: .
21:  
22: 我喜欢吃苹果
28: 。

Whereas, the expected output is

0 I
1  
2 like
6  
7 to
9  
10 eat
13  
14 apples
20 .
21  
22 我
23 喜欢
25 吃
26 苹果
28 。

I even tried pure Chinese text, but the words are broken on whitespace and punctuation characters.

I am programming for a server, so the jar file size is not a big concern. I am trying to find the number of words that is different in a given content compared to a sample content using Least Common Subsequence (but on words).

What am I doing wrong?


回答1:


The standard BreakIterator does not support detection of "word" boundaries within unbroken strings of CJK ideographs. There is a bug report on this subject, but it was closed in 2006 as "Won't Fix".

Instead, you'll need to use the ICU implementation. If you're developing on Android, you already have this as android.icu.text.BreakIterator. Otherwise, you'll need to download the ICU4J library from http://site.icu-project.org/download, which has it as com.ibm.icu.text.BreakIterator.



来源:https://stackoverflow.com/questions/44507838/breakiterator-not-working-correctly-with-chinese-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!