itext java pdf to text creation

前端 未结 3 1624
栀梦
栀梦 2020-12-02 02:55

I use a itext for converting pdf to text file, it works good actually but for some words it do the following thing: for example in pdf there is phrase like \"present the mai

3条回答
  •  庸人自扰
    2020-12-02 03:26

    To expand on the brilliant explanation by mkl, here is a detail for a specific variation of the issue presented in the question. I stumbled upon a document from which I wanted to extract text. Every letter came out seperated by a space.

    text would read as "t e x t"
    

    I tried implementing my own extraction strategy class as outlined by mkl. Whichever factor I tried to apply to the "single space width" value, the text came out the same way as before. So I debugged my code to see the width value itself and it turned out to be 0.

    To circumvent that you can use a fix value in the code outlined by mkl:

    float spacing = lastEnd.subtract(start).length();
    if (spacing > someFixValue)
    {
        result.append(' ');
    }
    

    if you base your own extraction strategy on LocationTextExtractionStrategy, the method you want to override is: IsChunkAtWordBoundary(...)

提交回复
热议问题