get embedded resourses in doc files using apache tika
I have ms word documents containing text and images. I want to parse them to have xml structure for them. After researching I end up using apache tika for converting my documents. I can parse my doc to xml. here is my code: AutoDetectParser parser=new AutoDetectParser(); InputStream input=new FileInputStream(new File("1.docx")); Metadata metadata = new Metadata(); StringWriter sw = new StringWriter(); SAXTransformerFactory factory = (SAXTransformerFactory)SAXTransformerFactory.newInstance(); TransformerHandler handler = factory.newTransformerHandler(); handler.getTransformer()