Issue with reading some unicode characters out of a PDF using PDFBox

坚强是说给别人听的谎言 提交于 2019-12-20 02:34:05

问题


I am new to PDFBOX. I am reading a PDF file which is in Hindi.
I am having trouble reading some unicode characters out of a PDF using PDFBox.
I want to copy the string into java objects so that I can work on that.

There are couple of things I tried for reading the files.
1. I tried to use PDFTextStripper to read text from document but it prints garbage value and warning about missing unicode mappings.

    PDDocument document = PDDocument.load(pathToFile);
    PDFTextStripper s = new PDFTextStripper();
    System.out.println(s.getText(document));//prints garbage values
    System.out.println(document.getNumberOfPages());//right output
    PDPageTree pages = document.getPages();
    System.out.println(pages.get(0).getResources().getFontNames()); //prints [COSName{TT1}, COSName{TT3}, COSName{TT8}]
  1. I tried to simply extract the contents of the file and wrie it back to other file. To my suprise it does read some characters(eg text which is selected in image) but I am not able to read values whch are written in bold.

    private static void extractTextUse(String pdfFile) throws IOException
    {
        ExtractText.main(new String[]{pdfFile, "E:\\try-1.txt"}); 
    }
    

I basically want to copy the string into java objects.

Below is the warning I am getting while reading the PDF file on both instances

Sep 05, 2016 10:00:37 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+231 (231) in font JCBMGH+Mangal
Sep 05, 2016 10:00:37 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+232 (232) in font JCBLPH+Mangal,Bold
Sep 05, 2016 10:00:38 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+227 (227) in font JCBLPH+Mangal,Bold

来源:https://stackoverflow.com/questions/39324398/issue-with-reading-some-unicode-characters-out-of-a-pdf-using-pdfbox

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!