Superscript and subscript differentiation using pdf box

一个人想着一个人 提交于 2019-12-07 11:48:06

问题


I am new to pdfbox Is there any way to differentiate superscript and subscript text from normal text when extracting or after extracting text from pdf using pdfbox library thanks..


回答1:


Check this link if this helps

https://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/PrintTextLocations.java




回答2:


Was able to identify most superscripts by looking for Y and Height changes. Try this:

Write your own implementation of PDFTextStripper.

Add this to writePage() to convert superscripts into separate words:

if((position.getY() < lastPosition.getTextPosition().getY()
    && position.getHeight() < lastPosition.getTextPosition().getHeight())
    || (position.getY() > lastPosition.getTextPosition().getY()
    && position.getHeight() > lastPosition.getTextPosition().getHeight()))
                    line.add(WordSeparator.getSeparator());

Then add this to writeLine() to add a tag before or after Superscripts:

        if(word.textPositions.size()>0)
        {
        TextPosition firstChar = word.textPositions.get(0);

        if(i==0)
        {
            prevY = firstChar.getY();
            prevHeight = firstChar.getHeight();
        }

        if(prevY!=0)
        {
        if((firstChar.getY() < prevY && firstChar.getHeight() < prevHeight))
        {
    output.write("<sup>");
    writeString(word.getText(), word.getTextPositions());

        }
        else if((firstChar.getY() > prevY && firstChar.getHeight() > prevHeight))
        {
             output.write("</sup>");
             writeString(word.getText(), word.getTextPositions());

        }
        else
            writeString(word.getText(), word.getTextPositions());
        }


来源:https://stackoverflow.com/questions/27700500/superscript-and-subscript-differentiation-using-pdf-box

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!