Convert .docx to HTML using JAVA

拟墨画扇 提交于 2019-11-30 15:16:40

问题


I tried converting .doc to HTML by using WordToHtmlConverter and it worked perfectly.

But when i tried to convert .docx to HTML, i got stuck with it.

What i tried:

I used the below code to convert .docx to HTML:

The code which i tried from : How to use Tika's XWPFWordExtractorDecorator class?

        InputStream input = TikaInputStream.get(new File("C:\\Users\\Downloads\\filename.docx"));


        Parser parser = new AutoDetectParser();


        StringWriter sw = new StringWriter();
        SAXTransformerFactory factory = (SAXTransformerFactory)
                 SAXTransformerFactory.newInstance();
        TransformerHandler handler = factory.newTransformerHandler();
        handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
        handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
        handler.setResult(new StreamResult(sw));


        try {
            Metadata metadata = new Metadata();
            parser.parse(input, handler, metadata, new ParseContext());
            String xml = sw.toString();
            System.out.print("tika : "+xml); 
        } finally {
            input.close();
        }

The output what i got is,

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
</head>
<body/>
</html>
  • Please explain where i gone wrong?
  • Is there any better way to convert .docx to html string

Appreciate your help, Thanks


回答1:


This code worked for me to convert .docx to html:

You can also look at the link : Link to code

       //convert .docx to HTML string
        InputStream in= new FileInputStream(new File(path));
        XWPFDocument document = new XWPFDocument(in);


        XHTMLOptions options = XHTMLOptions.create().URIResolver(new FileURIResolver(new File("word/media")));

        OutputStream out = new ByteArrayOutputStream();


        XHTMLConverter.getInstance().convert(document, out, options);
        String html=out.toString();
        System.out.println(html);



回答2:


You may want to make use of Mammoth docx to HTML library.Its a library for displaying doc, docx documents by converting them to html on the browser side as well as can be handled on the backend.

  • Library Supports - JavaScript, both the browser and node.js. Available on npm. Python. Available on PyPI. WordPress. Java/JVM. Available on Maven Central. .NET. Available on NuGet.
  • Link: https://mike.zwobble.org/projects/mammoth/ (Demo and Article)
  • Github: https://github.com/mwilliamson/mammoth.js


来源:https://stackoverflow.com/questions/24652953/convert-docx-to-html-using-java

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!