Is Apache Tika able to extract foreign languages like Chinese, Japanese?

◇◆丶佛笑我妖孽 提交于 2019-12-07 06:42:55

问题


Is Apache Tika able to extract foreign languages like Chinese, Japanese?

I have the following code:

    Detector detector = new DefaultDetector();
    Parser parser = new AutoDetectParser(detector);
    InputStream stream = new ByteArrayInputStream(bytes);
    OutputStream outputstream = new ByteArrayOutputStream();
    ContentHandler textHandler = new BodyContentHandler(outputstream);
    Metadata metadata = new Metadata();
    // Set<String> langs = LanguageIdentifier.getSupportedLanguages();
    // metadata.set(Metadata.CONTENT_LANGUAGE, lang);
    // metadata.set(Metadata.FORMAT, hint);
    ParseContext context = new ParseContext();
    try {
        parser.parse(stream, textHandler, metadata, context);
        String extractedText = outputstream.toString();
        return extractedText;
    } catch (IOException e) {
        e.printStackTrace();
    } catch (SAXException e) {
        e.printStackTrace();
    } catch (TikaException e) {
        e.printStackTrace();
    }

If the input is a doc file that contains Chinese characters, each Chinese characters will be extracted as "?".

Thanks a lot!


回答1:


Apache Tika is able to extract unicode text from its supported file formats. As long as the file format can store unicode text (eg Chinese or Japanese characters), Apache Tika can extract it

Tika also includes a number of unit tests for this, which verify it works. One such test uses this sample chinese email. If with use the command line Tika app, and grab the first few lines, we see it working:

$ java -jar tika-app-1.4.jar --text testMSG_chinese.msg | head
Alfresco MSG format testing ( MSG 格式測試 )
    From
    Tests Chang@FT (張毓倫)
    To
    Tests Chang@FT (張毓倫)
    Recipients
    tests.chang@fengttt.com

Or with this Japanese document:

$ java -jar tika-app-1.4.jar --text testRTFJapanese.rtf | head -2
ゾルゲの処刑記録、
ゾルゲと尾崎、淡々と最期 

You'll just need to ensure that any text output you generate gets stored in a suitable encoding (eg utf8), and the font you use to display it supports those glyphs!




回答2:


I have not seen anywhere written that Apache Tika does not support foreign languages like Chinese and Japanese. But when looking at following Apache Tika source file, I could not find both of the languages.

http://svn.apache.org/repos/asf/tika/branches/1.4/tika-core/src/main/resources/org/apache/tika/language/tika.language.properties

However you can still try out the implementing in same way as discussed in five min user guide to test with your Chinese Doc file

https://tika.apache.org/1.4/parser_guide.html



来源:https://stackoverflow.com/questions/15638944/is-apache-tika-able-to-extract-foreign-languages-like-chinese-japanese

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!