get embedded resourses in doc files using apache tika

荒凉一梦 提交于 2019-12-22 08:27:09

问题


I have ms word documents containing text and images. I want to parse them to have xml structure for them. After researching I end up using apache tika for converting my documents. I can parse my doc to xml. here is my code:

AutoDetectParser parser=new AutoDetectParser();
InputStream input=new FileInputStream(new File("1.docx"));
Metadata metadata = new Metadata();
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory)SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "no");
handler.setResult(new StreamResult(sw));

parser.parse(input, handler, metadata, new ParseContext());
String xhtml = sw.toString();

I want to extract images from document and convert them to binary format. I don't know how to extract embedded resources from document.


回答1:


You need to define your own class which implements Parser and attach that to the ParseContext you supply when parsing the outer document. Your Parser will then be called for all embedded resources, allowing you to save them out if you want to

The best example I can think of for this is in the Tika CLI, as used by the -z (extract) flag. If you look in the source code for TikaCLI, you're looking for the FileEmbeddedDocumentExtractor as your example.

The simplest code would be something like:

final AutoDetectParser parser = new AutoDetectParser();

public class ExtractParser extends AbstractParser {
   private int att = 0;
   public Set<MediaType> getSupportedTypes(ParseContext context) {
     // Everything AutoDetect parser does
     return parser.getSupportedTypes(context);
   }
   public void parse(
        InputStream stream, ContentHandler handler,
        Metadata metadata, ParseContext context)
        throws IOException, SAXException, TikaException {
      // Stream to a new file
      File f = new File("out-" + (++att) + ".bin");
      FileOutputStream fout = new FileOutputStream(f);
      IOUtils.copy(strea, fout);
      fout.closee();
   }
}

InputStream input = new FileInputStream(new File("1.docx"));
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
context.set(Parser.class, extractParser);
parser.parse(input, handler, metadata, context);

You can also use the EmbeddedDocumentExtractor interface if you'd rather, depends on what you want to do if it's better to use Parser directly



来源:https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!