问题
From almost all pdf files written in Japanese, I got correct text with Apache Tika(1.7) and Apache PDFBox(1.8.8). Now I have a trouble with a pdf file which i cannot upload it here by business reason.
problem
All Japanese characters in a paragraph becomes "?", but in other paragraphs, Japanese characters are correct. in any case, ASCII chars are correct.
PDF file
All Japanese characters in the PDF document are seems to be correct in Adobe Acrobat on my Windows7 desktop. from Adobe Acrobat properties dialog, the PDF document has several Japanese font information. i don't know who/how made this file.
- MS-Mincho Type:TrueType(CID) <- several
- HeiseiMin-W3 Type:Type 1(CID) Encoding:UniJIS-UCS2-HW-H Actual Font:KozMinPr6N-Regular Actual Font Type:Type 1(CID)
- MSMincho Type:TrueType(CID) Encoding:UniJIS-UCS2-H Actual Font:MS明朝 Actual Font Type:TrueType
PDF Converter:Acrobat Distiller 7.0(Windows) PDF Version:1.6(Acrobat 7.x)
foundings
"?"s are made in PDFStreamEngine (line 492) caused by lookup failure in PDType0Font(line 202). cmapName of cmap(of PDFont class) in this situation is "UniJIS-UCS2-HW-H". looking at CMap implementation carefully, isInCodeSpaceRanges method returns true when it should be true. finally, because char2CIDMappings has no entry and range.map fails In CMap(around line 174), lookupCID fails. An argument char[] has values such as [48, -120, 48, -118, ...] seems to be correct code points in Unicode for me...
is there any workaround? thanks.
回答1:
I solved font issues (chinese, japanese, korean and any other) in pdfbox by turning text into image like this
void writeLine(String text, int x, int y, int width, int height,
Font font, Color color, PDPageContentStream contentStream, PDDocument document) throws IOException {
try (
ByteArrayOutputStream baos = new ByteArrayOutputStream()
) {
int scale = 2;
BufferedImage img = new BufferedImage(width * scale, height * scale, BufferedImage.TYPE_INT_ARGB);
Graphics2D g2d = img.createGraphics();
g2d.setRenderingHint(RenderingHints.KEY_ALPHA_INTERPOLATION, RenderingHints.VALUE_ALPHA_INTERPOLATION_QUALITY);
g2d.setRenderingHint(RenderingHints.KEY_ANTIALIASING, RenderingHints.VALUE_ANTIALIAS_ON);
g2d.setRenderingHint(RenderingHints.KEY_TEXT_ANTIALIASING, RenderingHints.VALUE_TEXT_ANTIALIAS_ON);
g2d.setRenderingHint(RenderingHints.KEY_COLOR_RENDERING, RenderingHints.VALUE_COLOR_RENDER_QUALITY);
g2d.setRenderingHint(RenderingHints.KEY_DITHERING, RenderingHints.VALUE_DITHER_ENABLE);
g2d.setRenderingHint(RenderingHints.KEY_FRACTIONALMETRICS, RenderingHints.VALUE_FRACTIONALMETRICS_ON);
g2d.setRenderingHint(RenderingHints.KEY_INTERPOLATION, RenderingHints.VALUE_INTERPOLATION_BILINEAR);
g2d.setRenderingHint(RenderingHints.KEY_RENDERING, RenderingHints.VALUE_RENDER_SPEED);
g2d.setRenderingHint(RenderingHints.KEY_STROKE_CONTROL, RenderingHints.VALUE_STROKE_PURE);
g2d.setFont(font);
g2d.setColor(color);
g2d.scale(scale,scale);
g2d.drawString(text, 0, g2d.getFontMetrics().getAscent());
g2d.dispose();
ImageIO.write(img, "png", baos);
baos.flush();
baos.close();
contentStream.drawImage(PDImageXObject.createFromByteArray(
document,baos.toByteArray(), ""), x, y, width, height);
}
}
来源:https://stackoverflow.com/questions/29203976/pdfbox-outputs-question-marks-instead-of-some-japanese-characters