Superscript and subscript differentiation using pdf box

问题

I am new to pdfbox Is there any way to differentiate superscript and subscript text from normal text when extracting or after extracting text from pdf using pdfbox library thanks..

回答1:

Check this link if this helps

https://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/PrintTextLocations.java

回答2:

Was able to identify most superscripts by looking for Y and Height changes. Try this:

Write your own implementation of PDFTextStripper.

Add this to writePage() to convert superscripts into separate words:

if((position.getY() < lastPosition.getTextPosition().getY()
    && position.getHeight() < lastPosition.getTextPosition().getHeight())
    || (position.getY() > lastPosition.getTextPosition().getY()
    && position.getHeight() > lastPosition.getTextPosition().getHeight()))
                    line.add(WordSeparator.getSeparator());

Then add this to writeLine() to add a tag before or after Superscripts:

        if(word.textPositions.size()>0)
        {
        TextPosition firstChar = word.textPositions.get(0);

        if(i==0)
        {
            prevY = firstChar.getY();
            prevHeight = firstChar.getHeight();
        }

        if(prevY!=0)
        {
        if((firstChar.getY() < prevY && firstChar.getHeight() < prevHeight))
        {
    output.write("<sup>");
    writeString(word.getText(), word.getTextPositions());

        }
        else if((firstChar.getY() > prevY && firstChar.getHeight() > prevHeight))
        {
             output.write("</sup>");
             writeString(word.getText(), word.getTextPositions());

        }
        else
            writeString(word.getText(), word.getTextPositions());
        }

来源：https://stackoverflow.com/questions/27700500/superscript-and-subscript-differentiation-using-pdf-box

标签

java

pdfbox