get embedded resourses in doc files using apache tika

核能气质少年 提交于 2019-12-05 16:01:28

You need to define your own class which implements Parser and attach that to the ParseContext you supply when parsing the outer document. Your Parser will then be called for all embedded resources, allowing you to save them out if you want to

The best example I can think of for this is in the Tika CLI, as used by the -z (extract) flag. If you look in the source code for TikaCLI, you're looking for the FileEmbeddedDocumentExtractor as your example.

The simplest code would be something like:

final AutoDetectParser parser = new AutoDetectParser();

public class ExtractParser extends AbstractParser {
   private int att = 0;
   public Set<MediaType> getSupportedTypes(ParseContext context) {
     // Everything AutoDetect parser does
     return parser.getSupportedTypes(context);
   }
   public void parse(
        InputStream stream, ContentHandler handler,
        Metadata metadata, ParseContext context)
        throws IOException, SAXException, TikaException {
      // Stream to a new file
      File f = new File("out-" + (++att) + ".bin");
      FileOutputStream fout = new FileOutputStream(f);
      IOUtils.copy(strea, fout);
      fout.closee();
   }
}

InputStream input = new FileInputStream(new File("1.docx"));
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
context.set(Parser.class, extractParser);
parser.parse(input, handler, metadata, context);

You can also use the EmbeddedDocumentExtractor interface if you'd rather, depends on what you want to do if it's better to use Parser directly

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!