PDFBox 2.0.7 ExtractText not working but 1.8.13 does and PDFReader as well

巧了我就是萌 提交于 2019-12-01 14:07:11

The information about the font in question in your PDF are contradictory and partially broken. Depending on how some software reacts to that it may or may not extract the text correctly.


On the one hand the font has an Encoding value WinAnsiEncoding. This is ok and matches what we see in the content stream, a one-byte encoding covering many of the ANSI codes.

On the other hand we have a ToUnicode map which implies that the underlying encoding is some two-byte encoding (it has a code space range <0000> <ffff>), and even if one ignores the two-byte nature, it has mappings which in particular map digit ANSI codes to uppercase letters, uppercase letter ANSI codes to other lowercase letters, and the lowercase 'l' ANSI code to the Unicode value of 'ä'.

When extracting text, PDFBox 2.0.x seems to follow the broken ToUnicode map (interpreting the two-byte codes in the tabel as one-byte codes, ignoring the upper 0) where possible (resulting in garbage) and else interpret the character code as ANSI (resulting in proper text). PDF 1.8.x seems to have ignored the ToUnicode map, and so does Adobe Reader.


Actually it looks like the ToUnicode map has been made for a font using Identity-H encoding.


If you are confronted with such a PDF and need to extract its text, you can pre-process it and remove the ToUnicode entries; thereafter text extraction should return proper text. E.g.

PDDocument document = PDDocument.load(SOURCE);

for (int pageNr = 0; pageNr < document.getNumberOfPages(); pageNr++)
{
    PDPage page = document.getPage(pageNr);
    PDResources resources = page.getResources();
    removeToUnicodeMaps(resources);
}

PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);

(ExtractText test method testNoToUnicodeTest2)

using helper methods

void removeToUnicodeMaps(PDResources pdResources) throws IOException
{
    COSDictionary resources = pdResources.getCOSObject();

    COSDictionary fonts = asDictionary(resources, COSName.FONT);
    if (fonts != null)
    {
        for (COSBase object : fonts.getValues())
        {
            while (object instanceof COSObject)
                object = ((COSObject)object).getObject();
            if (object instanceof COSDictionary)
            {
                COSDictionary font = (COSDictionary)object;
                font.removeItem(COSName.TO_UNICODE);
            }
        }
    }

    for (COSName name : pdResources.getXObjectNames())
    {
        PDXObject xobject = pdResources.getXObject(name);
        if (xobject instanceof PDFormXObject)
        {
            PDResources xobjectPdResources = ((PDFormXObject)xobject).getResources();
            removeToUnicodeMaps(xobjectPdResources);
        }
    }
}

COSDictionary asDictionary(COSDictionary dictionary, COSName name)
{
    COSBase object = dictionary.getDictionaryObject(name);
    return object instanceof COSDictionary ? (COSDictionary) object : null;
}

(from ExtractText)

You should execute this pre-processing as early as possible after loading the document to prevent the fonts including the wrong ToUnicode mappings to be read into the document font cache.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!