Extract unselectable content from PDF

与世无争的帅哥 提交于 2019-12-21 20:04:33

问题


I'm using Apache PDFBox to extract pages from PDF files and I can't find a way to extract content that is unselectable (either text or images). With content that is selectable from within the PDF files there is no problem.

Note that the PDFs in question dont have any restrictions regarding copying content, at least from what I saw on the files's "Document Restrictions Summary": they all have "Content Copying" and "Content Copying for Accessbility" allowed! On the same PDF file there is content that is selectable and other parts that aren't. What happens is that, the extracted pages come with "holes", i.e., they only have the selectable parts of the PDF. On MS Word though, if I add the PDFs as objects, the whole content of the PDF pages appear! So I was hoping to do the same with PDFBox lib or any other Java lib for that matter!

Here is the code I'm using to convert PDF pages to images:

private void convertPdfToImage(File pdfFile, int pdfId) throws IOException {
   PDDocument document = PDDocument.loadNonSeq(pdfFile, null);
   List<PDPage> pdPages = document.getDocumentCatalog().getAllPages();
   for (PDPage pdPage : pdPages) { 
       BufferedImage bim = pdPage.convertToImage(BufferedImage.TYPE_INT_RGB, 300);
       ImageIOUtil.writeImage(bim, TEMP_FILEPATH + pdfId + ".png", 300);
   }
   document.close();
}

Is there a way to extract unselectable content from an PDF with this Apache PDFBox library (or with any of the other similar libraries)? Or this is not possible at all? And if indeed it's not, why?

Much appreciated for any help!

EDIT: I'm using Adobe Reader as PDF viewer and PDFBox v1.8. Here is a sample PDF: https://dl.dropboxusercontent.com/u/2815529/test.pdf


回答1:


The two images in question, the fischer logo in the upper right and the small sketch a bit down, are each drawn by filling a region on the page with a tiling pattern which in turn in its content stream draws the respective image.

Adobe Reader does not allow to select contents of patterns, and automatic image extractors often do not walk the Pattern resource tree either.

PDFBox 1.8.10

You can use PDFBox to fairly easily build a pattern image extractor, e.g. for PDFBox 1.8.10:

public void extractPatternImages(PDDocument document, String fileNameFormat) throws IOException
{
    List<PDPage> pages = document.getDocumentCatalog().getAllPages();
    if (pages == null)
        return;

    for (int i = 0; i < pages.size(); i++)
    {
        String pageFormat = String.format(fileNameFormat, "-" + i + "%s", "%s");
        extractPatternImages(pages.get(i), pageFormat);
    }
}

public void extractPatternImages(PDPage page, String pageFormat) throws IOException
{
    PDResources resources = page.getResources();
    if (resources == null)
        return;
    Map<String, PDPatternResources> patterns = resources.getPatterns();

    for (Map.Entry<String, PDPatternResources> patternEntry : patterns.entrySet())
    {
        String patternFormat = String.format(pageFormat, "-" + patternEntry.getKey() + "%s", "%s");
        extractPatternImages(patternEntry.getValue(), patternFormat);
    }
}

public void extractPatternImages(PDPatternResources pattern, String patternFormat) throws IOException
{
    COSDictionary resourcesDict = (COSDictionary) pattern.getCOSDictionary().getDictionaryObject(COSName.RESOURCES);
    if (resourcesDict == null)
        return;
    PDResources resources = new PDResources(resourcesDict);
    Map<String, PDXObject> xObjects = resources.getXObjects();
    if (xObjects == null)
        return;

    for (Map.Entry<String, PDXObject> entry : xObjects.entrySet())
    {
        PDXObject xObject = entry.getValue();
        String xObjectFormat = String.format(patternFormat, "-" + entry.getKey() + "%s", "%s");
        if (xObject instanceof PDXObjectForm)
            extractPatternImages((PDXObjectForm)xObject, xObjectFormat);
        else if (xObject instanceof PDXObjectImage)
            extractPatternImages((PDXObjectImage)xObject, xObjectFormat);
    }
}

public void extractPatternImages(PDXObjectForm form, String imageFormat) throws IOException
{
    PDResources resources = form.getResources();
    if (resources == null)
        return;
    Map<String, PDXObject> xObjects = resources.getXObjects();
    if (xObjects == null)
        return;

    for (Map.Entry<String, PDXObject> entry : xObjects.entrySet())
    {
        PDXObject xObject = entry.getValue();
        String xObjectFormat = String.format(imageFormat, "-" + entry.getKey() + "%s", "%s");
        if (xObject instanceof PDXObjectForm)
            extractPatternImages((PDXObjectForm)xObject, xObjectFormat);
        else if (xObject instanceof PDXObjectImage)
            extractPatternImages((PDXObjectImage)xObject, xObjectFormat);
    }

    Map<String, PDPatternResources> patterns = resources.getPatterns();

    for (Map.Entry<String, PDPatternResources> patternEntry : patterns.entrySet())
    {
        String patternFormat = String.format(imageFormat, "-" + patternEntry.getKey() + "%s", "%s");
        extractPatternImages(patternEntry.getValue(), patternFormat);
    }
}

public void extractPatternImages(PDXObjectImage image, String imageFormat) throws IOException
{
    image.write2OutputStream(new FileOutputStream(String.format(imageFormat, "", image.getSuffix())));
}

(ExtractPatternImages.java)

I applied it to your sample PDF like this

public void testtestDrJorge() throws IOException
{
    try (InputStream resource = getClass().getResourceAsStream("testDrJorge.pdf"))
    {
        PDDocument document = PDDocument.load(resource);
        extractPatternImages(document, "testDrJorge%s.%s");;
    }
}

(ExtractPatternImages.java)

and got two images:

  • `testDrJorge-0-R15-R14.png

  • testDrJorge-0-R38-R37.png

The images have lost their red parts. This most likely is dues to the fact that PDFBox version 1.x.x do not properly support extraction of CMYK images, cf. PDFBOX-2128 (CMYK images are not supported correctly), and your images are in CMYK.

PDFBox 2.0.0 release candidate

I updated the code to PDFBox 2.0.0 (currently available as release candidate only):

public void extractPatternImages(PDDocument document, String fileNameFormat) throws IOException
{
    PDPageTree pages = document.getDocumentCatalog().getPages();
    if (pages == null)
        return;

    for (int i = 0; i < pages.getCount(); i++)
    {
        String pageFormat = String.format(fileNameFormat, "-" + i + "%s", "%s");
        extractPatternImages(pages.get(i), pageFormat);
    }
}

public void extractPatternImages(PDPage page, String pageFormat) throws IOException
{
    PDResources resources = page.getResources();
    if (resources == null)
        return;
    Iterable<COSName> patternNames = resources.getPatternNames();

    for (COSName patternName : patternNames)
    {
        String patternFormat = String.format(pageFormat, "-" + patternName + "%s", "%s");
        extractPatternImages(resources.getPattern(patternName), patternFormat);
    }
}

public void extractPatternImages(PDAbstractPattern pattern, String patternFormat) throws IOException
{
    COSDictionary resourcesDict = (COSDictionary) pattern.getCOSObject().getDictionaryObject(COSName.RESOURCES);
    if (resourcesDict == null)
        return;
    PDResources resources = new PDResources(resourcesDict);
    Iterable<COSName> xObjectNames = resources.getXObjectNames();
    if (xObjectNames == null)
        return;

    for (COSName xObjectName : xObjectNames)
    {
        PDXObject xObject = resources.getXObject(xObjectName);
        String xObjectFormat = String.format(patternFormat, "-" + xObjectName + "%s", "%s");
        if (xObject instanceof PDFormXObject)
            extractPatternImages((PDFormXObject)xObject, xObjectFormat);
        else if (xObject instanceof PDImageXObject)
            extractPatternImages((PDImageXObject)xObject, xObjectFormat);
    }
}

public void extractPatternImages(PDFormXObject form, String imageFormat) throws IOException
{
    PDResources resources = form.getResources();
    if (resources == null)
        return;
    Iterable<COSName> xObjectNames = resources.getXObjectNames();
    if (xObjectNames == null)
        return;

    for (COSName xObjectName : xObjectNames)
    {
        PDXObject xObject = resources.getXObject(xObjectName);
        String xObjectFormat = String.format(imageFormat, "-" + xObjectName + "%s", "%s");
        if (xObject instanceof PDFormXObject)
            extractPatternImages((PDFormXObject)xObject, xObjectFormat);
        else if (xObject instanceof PDImageXObject)
            extractPatternImages((PDImageXObject)xObject, xObjectFormat);
    }

    Iterable<COSName> patternNames = resources.getPatternNames();

    for (COSName patternName : patternNames)
    {
        String patternFormat = String.format(imageFormat, "-" + patternName + "%s", "%s");
        extractPatternImages(resources.getPattern(patternName), patternFormat);
    }
}

public void extractPatternImages(PDImageXObject image, String imageFormat) throws IOException
{
    String filename = String.format(imageFormat, "", image.getSuffix());
    ImageIOUtil.writeImage(image.getOpaqueImage(), "png", new FileOutputStream(filename));
}

and get

  • testDrJorge-0-COSName{R15}-COSName{R14}.png

  • testDrJorge-0-COSName{R38}-COSName{R37}.png

Looks like an improvement... ;)



来源:https://stackoverflow.com/questions/34667268/extract-unselectable-content-from-pdf

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!