iTextSharp inserting spaces within words from a pdf file

。_饼干妹妹 提交于 2019-12-06 15:20:16
mkl

why this does this

The cause actually is a feature of the text extraction strategy which in your case does not work as desired.

A bit of background: What you perceive as a space between words in a PDF file does not necessarily come into being due to an instruction drawing a space character, it can also be the result of an instruction shifting the text insertion position a little to the right. Thus, text extraction strategies usually add a space character when finding a sufficiently large right-shift like that. For some more on this (in particular the "sufficiently large" part) confer e.g. this answer.

In case of your document, though, the text body font has too small font width information (if used as is, the characters appear glued together with no space in-between whatsoever); thus, there are small right shifts between each couple of consecutive characters, some of these shifts wide enough to be falsely identified as word separation by the mechanism explained above.

how to resolve this

As word separations in your PDF are created by instructions drawing a space character, you do not need the feature explained above. Thus, the easiest way to resolve the issue is to use a text extraction strategy without that feature.

You can create such a strategy by copying the source code of the SimpleTextExtractionStrategy (e.g. from here) and comment out some lines from the method RenderText as below:

public virtual void RenderText(TextRenderInfo renderInfo)
{
    [...]

    if (hardReturn)
    {
        //System.out.Println("<< Hard Return >>");
        AppendTextChunk('\n');
    }
    else if (!firstRender)
    {
//        if (result[result.Length - 1] != ' ' && renderInfo.GetText().Length > 0 && renderInfo.GetText()[0] != ' ')
//        { // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
//            float spacing = lastEnd.Subtract(start).Length;
//            if (spacing > renderInfo.GetSingleSpaceWidth() / 2f)
//            {
//                AppendTextChunk(' ');
//                //System.out.Println("Inserting implied space before '" + renderInfo.GetText() + "'");
//            }
//        }
    }
    else
    {
        //System.out.Println("Displaying first string of content '" + text + "' :: x1 = " + x1);
    }

    [...]
}

Using this simplified extraction strategy, your text is properly extracted.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!