问题
I am new to pdfbox Is there any way to differentiate superscript and subscript text from normal text when extracting or after extracting text from pdf using pdfbox library thanks..
回答1:
Check this link if this helps
https://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/PrintTextLocations.java
回答2:
Was able to identify most superscripts by looking for Y and Height changes. Try this:
Write your own implementation of PDFTextStripper.
Add this to writePage()
to convert superscripts into separate words:
if((position.getY() < lastPosition.getTextPosition().getY()
&& position.getHeight() < lastPosition.getTextPosition().getHeight())
|| (position.getY() > lastPosition.getTextPosition().getY()
&& position.getHeight() > lastPosition.getTextPosition().getHeight()))
line.add(WordSeparator.getSeparator());
Then add this to writeLine() to add a tag before or after Superscripts:
if(word.textPositions.size()>0)
{
TextPosition firstChar = word.textPositions.get(0);
if(i==0)
{
prevY = firstChar.getY();
prevHeight = firstChar.getHeight();
}
if(prevY!=0)
{
if((firstChar.getY() < prevY && firstChar.getHeight() < prevHeight))
{
output.write("<sup>");
writeString(word.getText(), word.getTextPositions());
}
else if((firstChar.getY() > prevY && firstChar.getHeight() > prevHeight))
{
output.write("</sup>");
writeString(word.getText(), word.getTextPositions());
}
else
writeString(word.getText(), word.getTextPositions());
}
来源:https://stackoverflow.com/questions/27700500/superscript-and-subscript-differentiation-using-pdf-box