Encoding issue with apache poi converter

问题

I have an ms word doc file that i'm converting to an html document using apache poi.

this is the code i'm running

    InputStream input = new FileInputStream (path);
    HWPFDocument wordDocument = new HWPFDocument (input);            
    WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter (DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument() );

    List<Picture> pics = wordDocument.getPicturesTable().getAllPictures();
    if (pics != null) 
    {
        for (int i = 0; i <pics.size(); i++) 
        {
            Picture pic = (Picture) pics.get (i);
            try 
            {
                pic.writeImageContent (new FileOutputStream (path + pic.hashCode() + '.' + pic.suggestFileExtension()) );
            }
            catch (FileNotFoundException e) 
            {
                e.printStackTrace();
            }
        }
    }

    wordToHtmlConverter.setPicturesManager (new PicturesManager() 
    {               
        public String savePicture (byte[] content, PictureType pictureType, String suggestedName, float widthInches, float heightInches) 
        {
            for(Picture picName:pics)
            {
                return Integer.toString(picName.hashCode()) + '.' + picName.suggestFileExtension();
            }

            return null;
        }
    });

    wordToHtmlConverter.processDocument(wordDocument);                       
    Document htmlDocument = wordToHtmlConverter.getDocument();                        
    ByteArrayOutputStream outStream = new ByteArrayOutputStream();
    DOMSource domSource = new DOMSource(htmlDocument);
    StreamResult streamResult = new StreamResult (outStream);

    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer serializer = tf.newTransformer();
    serializer.setOutputProperty (OutputKeys.ENCODING, "gbk");
    serializer.setOutputProperty (OutputKeys.INDENT, "yes");
    serializer.setOutputProperty (OutputKeys.METHOD, "html");
    serializer.transform (domSource, streamResult);
    outStream.close();

    String html = new String (outStream.toByteArray());

The code works fine, it's preserving images and styles. However there seems to be a problem with some characters in the html it's not encoding properly. For instance, some of the bullet point styles in the original .doc file are not outputting correctly. I've tried multiple characters sets (ASCII, UTF-8, gbk ...) all are not producing the bullet points correctly.

I'm %99 percent sure the bullets are showing gibberish because of the encoding. Has anyone come across a problem like this with apache?

回答1:

This is not an encoding problem but a font problem. Word uses ANSI code and special fonts for it's default bullet lists. The first bullet point for example is a bullet from font "Symbol". The second bullet point is a circle from font "Courier New", The third bullet point is a square from font "Wingdings".

So the easiest possibility will be simply to replace the ANSI codes of the bullet texts with unicode. So done we can use UTF-8 for the HTML.

Example:

Word WordBulletList.doc:

Java:

import java.io.StringWriter;
import java.io.FileInputStream;
import java.io.File;
import java.io.PrintWriter;

import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import javax.xml.parsers.DocumentBuilderFactory;

import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.HWPFDocumentCore;
import org.apache.poi.hwpf.usermodel.Paragraph;
import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.converter.FontReplacer;
import org.apache.poi.hwpf.converter.FontReplacer.Triplet;

import org.w3c.dom.Document;

import java.awt.Desktop;

public class TestWordToHtmlConverter {

 public static void main(String[] args) throws Exception {

  Document newDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();

  WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDocument) {

   protected void processParagraph(HWPFDocumentCore hwpfDocument, 
                                   org.w3c.dom.Element parentElement, 
                                   int currentTableLevel, 
                                   Paragraph paragraph, 
                                   java.lang.String bulletText) {
    if (bulletText!="") {
     //System.out.println((int)bulletText.charAt(0));
     bulletText = bulletText.replace("\uF0B7", "\u2022");
     bulletText = bulletText.replace("\u006F", "\u00A0\u00A0\u26AA");
     bulletText = bulletText.replace("\uF0A7", "\u00A0\u00A0\u00A0\u00A0\u25AA");
    }

    super.processParagraph(hwpfDocument, parentElement, currentTableLevel, paragraph, bulletText);
   }

  };

  wordToHtmlConverter.processDocument(new HWPFDocument(new FileInputStream("WordBulletList.doc")));

  StringWriter stringWriter = new StringWriter();
  Transformer transformer = TransformerFactory.newInstance().newTransformer();
  transformer.setOutputProperty( OutputKeys.INDENT, "yes" );
  transformer.setOutputProperty( OutputKeys.ENCODING, "utf-8" );
  transformer.setOutputProperty( OutputKeys.METHOD, "html" );
  transformer.transform(new DOMSource(wordToHtmlConverter.getDocument()), new StreamResult(stringWriter));

  String html = stringWriter.toString();

  try(PrintWriter out = new PrintWriter("WordBulletList.html")) {
    out.println(html);
  }

  File htmlFile = new File("WordBulletList.html");
  Desktop.getDesktop().browse(htmlFile.toURI());

 }
}

HTML:

...
<body class="b1 b2">
<p class="p1">
<span>Word bullet list:</span>
</p>
<p class="p2">
<span class="s1">&bull;&nbsp;</span><span>Bullet1</span>
</p>
<p class="p2">
<span class="s1">&nbsp;&nbsp;⚪&nbsp;</span><span>Bullet2</span>
</p>
<p class="p2">
<span class="s1">&nbsp;&nbsp;&nbsp;&nbsp;▪&nbsp;</span><span>Bullet3</span>
</p>
<p class="p2">
<span class="s1">&nbsp;&nbsp;⚪&nbsp;</span><span>Bullet2</span>
</p>
<p class="p2">
<span class="s1">&bull;&nbsp;</span><span>Bullet1</span>
</p>
<p class="p1">
<span>End</span>
</p>
</body>
...

回答2:

Problem SOLVED

I finally found a way to resolve this particular problem. The answer was inspired by @pawelini1 with his own question Encoding issue with Apache POI

The solution is simple, all I did was use a URLEncoder/Decoder on my html string

String html = URLEncoder.encode(new String(outStream.toByteArray(), "UTF-8"), "UTF-8");
String decoded = URLDecoder.decode(html, "UTF-8");

Now my webpage is displaying properly.

来源：https://stackoverflow.com/questions/41829890/encoding-issue-with-apache-poi-converter

标签

encoding

apache-poi

converter