extract text from xml tags in an XML file using apach tika parser

て烟熏妆下的殇ゞ 提交于 2020-02-22 07:28:45

问题


I am trying to extract all the text out of various documents. And for that I am using Apache Tika 1.4.

RecursiveTikaParser parser = new RecursiveTikaParser(new AutoDetectParser());
ParseContext parseContext = new ParseContext();
parseContext.set(Parser.class, parser);

RecursiveTikaParser here is just a wrapper on AutoDetectParser.

Parse method for which is something like this -

ContentHandler content = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
super.parse(stream, content, metadata, context);
System.out.println("Parsed text is " + content.toString());

Now, this code has to be able to handle multiple files so that's why I am using AutoDetectParser()

I noticed in my testing that given an xml file - I can only extract the text that is between the tags and not the comments, tags.

Is it possible to extract everything from the text file with my current approach ?


回答1:


Try like this

    Metadata metadata = new Metadata();
    stream = TikaInputStream.get(stream, null);
    String mimtType = DETECTOR.detect(stream, metadata).toString();
    Parser parser;
    if (mimtType.equalsIgnoreCase("application/xml")) {
        parser = new TXTParser();
    } else {
        parser = new AutoDetectParser();
    }

    ContentHandler content = new BodyContentHandler();
    parser.parse(stream, content, metadata, new ParseContext());
    System.out.println(content.toString());


来源:https://stackoverflow.com/questions/21175172/extract-text-from-xml-tags-in-an-xml-file-using-apach-tika-parser

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!