How to get font color using pdfbox

后端 未结 4 1050
迷失自我
迷失自我 2020-12-09 22:10

I am trying to extract text with all information from the pdf using pdfbox. I got all the information i want, except color. I tried different ways to get the fontcolor (incl

相关标签:
4条回答
  • 2020-12-09 22:48

    I tried the code in the link you posted and it worked for me. The colors I get back are 148.92, 179.01001 and 214.965. I wish I could give you my PDF to work with, maybe if I store it externally to SO? My PDF used a sort of palish blue color and that seems to match. It was just one page of text created in Word 2010 and exported, nothing too intense.

    A couple of suggestions ....

    1. Recall that the value returned is a float between 0 and 1. If a value is accidentally cast to int, then of course the values will end up containing nearly all 0. The linked to code multiples by 255 to get a range of 0 to 255.
    2. As the commenter said, the most common color for a PDF file is black which is 0 0 0

    That is all I can think of now, otherwise I have version of 1.7.1 of pdfbox and fontbox and like I said I pretty much followed the link you gave.

    EDIT

    Based upon my comments, here perhaps is a minorly invasive way of doing it for pdf files like color.pdf?

    In PDFStreamEngine.java in the processOperator method one can do inside the try block

    if (operation.equals("RG")) {
       // stroking color space
       System.out.println(operation);
       System.out.println(arguments);
    } else if (operation.equals("rg")) {
       // non-stroking color space
       System.out.println(operation);
       System.out.println(arguments);
    } else if (operation.equals("BT")) {
       System.out.println(operation);    
    } else if (operation.equals("ET")) {
       System.out.println(operation);           
    }
    

    This will show you the information, then it is up to you to process the color information for each section according to your needs. Here is a snippet from the beginning of the output of the above code when run on color.pdf ...

    BT rG [COSInt(1), COSInt(0), CosInt(0)] RG [COSInt(1), COSInt(0), CosInt(0)] ET BT ET BT rG [COSFloat{0.573}, COSFloat{0.816}, COSFloat{0.314}] RG [COSFloat{0.573}, COSFloat{0.816}, COSFloat{0.314}] ET ......

    You see in the above output an empty BT ET section, this being a section which is marked DEVICEGRAY. All the other give you [0,1] values for the R, G and B components

    0 讨论(0)
  • 2020-12-09 22:50

    I also ended up doing something like this. Pasting code below, hope it helps someone.

    import java.io.IOException;
    import java.util.List;
    import org.apache.pdfbox.exceptions.COSVisitorException;
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.pdmodel.PDPage;
    import org.apache.pdfbox.pdmodel.edit.PDPageContentStream;
    import org.apache.pdfbox.pdmodel.font.PDFont;
    import org.apache.pdfbox.pdmodel.font.PDType1Font;
    import org.apache.pdfbox.pdmodel.graphics.PDGraphicsState;
    import org.apache.pdfbox.util.PDFTextStripper;
    import org.apache.pdfbox.util.ResourceLoader;
    import org.apache.pdfbox.util.TextPosition;
    
    public class Parser extends PDFTextStripper {
    
    public Parser() throws IOException {
        super(ResourceLoader.loadProperties(
                "org/apache/pdfbox/resources/PageDrawer.properties", true));
        super.setSortByPosition(true);
    }
    
    public void parse(String path) throws IOException{
        PDDocument doc = PDDocument.load(path);
        List<PDPage> pages = doc.getDocumentCatalog().getAllPages();
        for (PDPage page : pages) {
            this.processStream(page, page.getResources(), page.getContents().getStream());
        }
    }
    
    @Override
    protected void processTextPosition(TextPosition text) {
        try {
            PDGraphicsState graphicsState = getGraphicsState();
            System.out.println("R = " + graphicsState.getNonStrokingColor().getJavaColor().getRed());
            System.out.println("G = " + graphicsState.getNonStrokingColor().getJavaColor().getGreen());
            System.out.println("B = " + graphicsState.getNonStrokingColor().getJavaColor().getBlue());
        }
        catch (IOException ioe) {}
    
    }
    
    public static void main(String[] args) throws IOException, COSVisitorException {
        Parser p = new Parser();
        p.parse("/Users/apple/Desktop/123.pdf");
    }
    
    }
    
    0 讨论(0)
  • 2020-12-09 23:03

    I found some code in one of my maintenance program.
    I do not know it works for you or not, please try It. Also check out this link http://pdfbox.apache.org/apidocs/org/apache/pdfbox/pdmodel/common/class-use/PDStream.html

    It may help you

    PDDocument doc = null;
    try {
        doc = PDDocument.load("C:/Path/To/Pdf/Sample.pdf");
        PDFStreamEngine engine = new PDFStreamEngine(ResourceLoader.loadProperties("org/apache/pdfbox/resources/PageDrawer.properties"));
        PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get(0);
        engine.processStream(page, page.findResources(), page.getContents().getStream());
        PDGraphicsState graphicState = engine.getGraphicsState();
        System.out.println(graphicState.getStrokingColor().getColorSpace().getName());
        float colorSpaceValues[] = graphicState.getStrokingColor().getColorSpaceValue();
        for (float c : colorSpaceValues) {
            System.out.println(c * 255);
        }
    }
    finally {
        if (doc != null) {
            doc.close();
        }
    
    0 讨论(0)
  • 2020-12-09 23:03

    With the pdfbox verson 2.0+ it is necessary to choose these operators in the constructor of your overwritten PDFTextStripper:

    addOperator(new SetStrokingColorSpace());
    addOperator(new SetNonStrokingColorSpace());
    addOperator(new SetStrokingDeviceCMYKColor());
    addOperator(new SetNonStrokingDeviceCMYKColor());
    addOperator(new SetNonStrokingDeviceRGBColor());
    addOperator(new SetStrokingDeviceRGBColor());
    addOperator(new SetNonStrokingDeviceGrayColor());
    addOperator(new SetStrokingDeviceGrayColor());
    addOperator(new SetStrokingColor());
    addOperator(new SetStrokingColorN());
    addOperator(new SetNonStrokingColor());
    addOperator(new SetNonStrokingColorN());
    

    Only then getGraphicsState() will return proper information.

    See https://pdfbox.apache.org/2.0/migration.html

    0 讨论(0)
提交回复
热议问题