How to write HTML text with Marathi text to PDF document using docx4j?

问题

I am using docx4j to create PDF documents from the HTML text. The HTML text has some English and Marathi text in it. English text comes properly in the pdf. but the marathi text is not displayed in the generated pdf.

In place of text, it shows square boxes.

Below is the code I am using.

import java.io.FileOutputStream;

import org.docx4j.Docx4J;
import org.docx4j.convert.in.xhtml.XHTMLImporterImpl;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;

public class ConvertInXHTMLFragment {

    static String DEST_PDF = "/home/Downloads/Sample.pdf";

    public static void main(String[] args) throws Exception {

        // String content = "<html>Hello</html>";
        String content = "<html>पासवर्ड</html>";

        WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();

        XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);

        wordMLPackage.getMainDocumentPart().getContent().addAll(XHTMLImporter.convert(content, null));

        Docx4J.toPDF(wordMLPackage, new FileOutputStream(DEST_PDF));
    }

}

EDIT 1:-

This is from one of the samples from XSLFO

import java.io.OutputStream;

import org.docx4j.Docx4J;
import org.docx4j.convert.out.FOSettings;
import org.docx4j.fonts.IdentityPlusMapper;
import org.docx4j.fonts.Mapper;
import org.docx4j.fonts.PhysicalFont;
import org.docx4j.fonts.PhysicalFonts;
import org.docx4j.model.fields.FieldUpdater;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.samples.AbstractSample;

public class ConvertOutPDFviaXSLFO extends AbstractSample {

    static {
        inputfilepath = "/home/Downloads/100.docx";;
        saveFO = true;
    }

    static boolean saveFO;

    public static void main(String[] args) 
            throws Exception {

        try {
            getInputFilePath(args);
        } catch (IllegalArgumentException e) {
        }

        String regex = null;
        PhysicalFonts.setRegex(regex);

        WordprocessingMLPackage wordMLPackage;
        System.out.println("Loading file from " + inputfilepath);
        wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath));

        FieldUpdater updater = null;

        Mapper fontMapper = new IdentityPlusMapper();
        wordMLPackage.setFontMapper(fontMapper);

        PhysicalFont font = PhysicalFonts.get("Arial Unicode MS");
        fontMapper.put("Mangal", font);

        FOSettings foSettings = Docx4J.createFOSettings();
        if (saveFO) {
            foSettings.setFoDumpFile(new java.io.File(inputfilepath + ".fo"));
        }
        foSettings.setWmlPackage(wordMLPackage);

        String outputfilepath;
        if (inputfilepath==null) {
            outputfilepath = System.getProperty("user.dir") + "/OUT_FontContent.pdf";           
        } else {
            outputfilepath = inputfilepath + ".pdf";
        }
        OutputStream os = new java.io.FileOutputStream(outputfilepath);

        Docx4J.toFO(foSettings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);

        System.out.println("Saved: " + outputfilepath);

        if (wordMLPackage.getMainDocumentPart().getFontTablePart()!=null) {
            wordMLPackage.getMainDocumentPart().getFontTablePart().deleteEmbeddedFontTempFiles();
        }

        // This would also do it, via finalize() methods
        updater = null;
        foSettings = null;
        wordMLPackage = null;
    }
}

Now, I get #### in place of Marathi texts in the output PDF.

回答1:

Docx4j v3.3 supports PDF output via 2 completely different ways.

The default is to use Plutext's PDF Converter. Things work if the mangal font you linked to is installed in the Conveter, and specified in the docx:

  <w:r>
    <w:rPr>
      <w:rFonts w:ascii="mangal" w:eastAsia="mangal" w:hAnsi="mangal" w:cs="mangal"/>
    </w:rPr>
    <w:t>पासवर्ड</w:t>
  </w:r>

Same would apply for Arial Unicode MS.

The other way is PDF via XSL FO; see https://github.com/plutext/docx4j-export-FO

If you have the relevant font installed it should just work. If you don't, then you need to tell it which font to use.

For example, suppose the docx specifies the mangal font, which I do not have. But I have Arial Unicode MS. So I tell the XSL FO process to use that instead:

fontMapper.put("mangal", PhysicalFonts.get("Arial Unicode MS"));

Note, you need to know which font your docx is specifying, and how to make specify the font you want. To do that in XHTML Import, copied from my answer to your earlier question:-

Fonts are handled by https://github.com/plutext/docx4j-ImportXHTML/blob/master/src/main/java/org/docx4j/convert/in/xhtml/FontHandler.java#L58

Marathi might be relying on one of the other attributes in the RFonts object. You'll need to look at a working docx to see. You can use https://github.com/plutext/docx4j-ImportXHTML/blob/master/src/main/java/org/docx4j/convert/in/xhtml/FontHandler.java#L54 to inject a suitable font mapping.

来源：https://stackoverflow.com/questions/44262279/how-to-write-html-text-with-marathi-text-to-pdf-document-using-docx4j

标签

java

pdf

docx4j