How to handle special characters when converting from HTML to DocX

ε祈祈猫儿з 提交于 2019-12-01 10:20:57

问题


I have a application that converts html files to DocX using DocX4J. I´m having problems with special characters like ç,á,é,í,ã,etc. My text font in the html files is Arial but when I convert them to DocX the special characters mentioned before are set to calibri font. So, in the same word (e.g Cláudio), I have "Cl" written in Arial font, "á" character in Calibri font and "udio" in Arial font.

I saw that maybe I have to set font property in w:r but I´m having difficulty to see how to do it to all runs of my text been converted. Also, I can´t see how to do it in my conversion code, that is listed below (with a sample html).

Any tip or suggestion about how to do this conversion and handle those special characters would be really great.

Cheers.

public WordprocessingMLPackage export(String xhtml) {

WordprocessingMLPackage wordMLPackage = null;
try {
    wordMLPackage = WordprocessingMLPackage.createPackage();
    XHTMLImporter importer = new XHTMLImporterImpl(wordMLPackage);
    List<Object> content = importer.convert(xhtml,null);
    wordMLPackage.getMainDocumentPart().getContent().addAll(content);
}
catch (Docx4JException e) {
    // ...
}
return wordMLPackage;
}

<html>
<head>
<meta charset="ISO-8859-1" />
<style type="text/css">
h1 {
    page-break-before: always;
}

p, h1 {
    font-family: Arial;
    font-size: 12pt;
}

p {
    line-height: 150%;
}

h1 {
    font-weight: bold;
    line-height: 130%
}
</style>
</head>
<body>
    <h1>RESUMO<br /></h1>
<p>
    <span>Um resumo para o relatório.</span><br />
</p>
</body>
</html>

回答1:


Following the tip given by JasonPlutext, I found an example of how to map a font to the XHTMLImporter at the DocX4J forum (http://www.docx4java.org/forums/docx-java-f6/docx-to-html-and-back-to-docx-t1913.html).

Now my code is working! See the final version below.


public WordprocessingMLPackage export(String xhtml) {

WordprocessingMLPackage wordMLPackage = null;
try {
    RFonts arialRFonts = Context.getWmlObjectFactory().createRFonts();
    arialRFonts.setAscii("Arial");
    arialRFonts.setHAnsi("Arial");
    XHTMLImporterImpl.addFontMapping("Arial", arialRFonts);

    wordMLPackage = WordprocessingMLPackage.createPackage();
    XHTMLImporter importer = new XHTMLImporterImpl(wordMLPackage);
    List<Object> content = importer.convert(xhtml,null);
    wordMLPackage.getMainDocumentPart().getContent().addAll(content);
}
catch (Docx4JException e) {
    // ...
}
return wordMLPackage;
}


来源:https://stackoverflow.com/questions/29607496/how-to-handle-special-characters-when-converting-from-html-to-docx

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!