I am new to pdfbox Is there any way to differentiate superscript and subscript text from normal text when extracting or after extracting text from pdf using pdfbox library thanks..
Ritz
Was able to identify most superscripts by looking for Y and Height changes. Try this:
Write your own implementation of PDFTextStripper.
Add this to writePage()
to convert superscripts into separate words:
if((position.getY() < lastPosition.getTextPosition().getY()
&& position.getHeight() < lastPosition.getTextPosition().getHeight())
|| (position.getY() > lastPosition.getTextPosition().getY()
&& position.getHeight() > lastPosition.getTextPosition().getHeight()))
line.add(WordSeparator.getSeparator());
Then add this to writeLine() to add a tag before or after Superscripts:
if(word.textPositions.size()>0)
{
TextPosition firstChar = word.textPositions.get(0);
if(i==0)
{
prevY = firstChar.getY();
prevHeight = firstChar.getHeight();
}
if(prevY!=0)
{
if((firstChar.getY() < prevY && firstChar.getHeight() < prevHeight))
{
output.write("<sup>");
writeString(word.getText(), word.getTextPositions());
}
else if((firstChar.getY() > prevY && firstChar.getHeight() > prevHeight))
{
output.write("</sup>");
writeString(word.getText(), word.getTextPositions());
}
else
writeString(word.getText(), word.getTextPositions());
}
来源:https://stackoverflow.com/questions/27700500/superscript-and-subscript-differentiation-using-pdf-box