How can I remove all images/drawings from a PDF file and leave text only in Java?

后端 未结 2 2100
萌比男神i
萌比男神i 2020-12-08 12:15

I have a PDF file that\'s an output from an OCR processor, this OCR processor recognizes the image, adds the text to the pdf but at the end places a low quality image instea

相关标签:
2条回答
  • 2020-12-08 12:31

    You need to parse the document as follows:

    public static void strip(String pdfFile, String pdfFileOut) throws Exception {
    
        PDDocument doc = PDDocument.load(pdfFile);
    
        List pages = doc.getDocumentCatalog().getAllPages();
        for( int i=0; i<pages.size(); i++ ) {
            PDPage page = (PDPage)pages.get( i );
    
            // added
            COSDictionary newDictionary = new COSDictionary(page.getCOSDictionary());
    
            PDFStreamParser parser = new PDFStreamParser(page.getContents());
            parser.parse();
            List tokens = parser.getTokens();
            List newTokens = new ArrayList();
            for(int j=0; j<tokens.size(); j++) {
                Object token = tokens.get( j );
    
                if( token instanceof PDFOperator ) {
                    PDFOperator op = (PDFOperator)token;
                    if( op.getOperation().equals( "Do") ) {
                        //remove the one argument to this operator
                        // added
                        COSName name = (COSName)newTokens.remove( newTokens.size() -1 );
                        // added
                        deleteObject(newDictionary, name);
                        continue;
                    }
                }
                newTokens.add( token );
            }
            PDStream newContents = new PDStream( doc );
            ContentStreamWriter writer = new ContentStreamWriter( newContents.createOutputStream() );
            writer.writeTokens( newTokens );
            newContents.addCompression();
    
            page.setContents( newContents );
    
            // added
            PDResources newResources = new PDResources(newDictionary);
            page.setResources(newResources);
        }
    
        doc.save(pdfFileOut);
        doc.close();
    }
    
    
    // added
    public static boolean deleteObject(COSDictionary d, COSName name) {
        for(COSName key : d.keySet()) {
            if( name.equals(key) ) {
                d.removeItem(key);
                return true;
            }
            COSBase object = d.getDictionaryObject(key); 
            if(object instanceof COSDictionary) {
                if( deleteObject((COSDictionary)object, name) ) {
                    return true;
                }
            }
        }
        return false;
    }
    
    0 讨论(0)
  • 2020-12-08 12:38

    I used Apache PDFBox in similar situation.

    To be a little bit more specific, try something like that:

    import org.apache.pdfbox.exceptions.COSVisitorException;
    import org.apache.pdfbox.exceptions.CryptographyException;
    import org.apache.pdfbox.exceptions.InvalidPasswordException;
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
    import org.apache.pdfbox.pdmodel.PDPage;
    import org.apache.pdfbox.pdmodel.PDResources;
    import java.io.IOException;
    
    public class Main {
        public static void main(String[] argv) throws COSVisitorException, InvalidPasswordException, CryptographyException, IOException {
            PDDocument document = PDDocument.load("input.pdf");
    
            if (document.isEncrypted()) {
                document.decrypt("");
            }
    
            PDDocumentCatalog catalog = document.getDocumentCatalog();
            for (Object pageObj :  catalog.getAllPages()) {
                PDPage page = (PDPage) pageObj;
                PDResources resources = page.findResources();
                resources.getImages().clear();
            }
    
            document.save("strippedOfImages.pdf");
        }
    }
    

    It's supposed to remove all types of images (png, jpeg, ...). It should work like that:

    .

    0 讨论(0)
提交回复
热议问题