I use a itext for converting pdf to text file, it works good actually but for some words it do the following thing: for example in pdf there is phrase like \"present the mai
To expand on the brilliant explanation by mkl, here is a detail for a specific variation of the issue presented in the question. I stumbled upon a document from which I wanted to extract text. Every letter came out seperated by a space.
text would read as "t e x t"
I tried implementing my own extraction strategy class as outlined by mkl. Whichever factor I tried to apply to the "single space width" value, the text came out the same way as before. So I debugged my code to see the width value itself and it turned out to be 0.
To circumvent that you can use a fix value in the code outlined by mkl:
float spacing = lastEnd.subtract(start).length();
if (spacing > someFixValue)
{
result.append(' ');
}
if you base your own extraction strategy on LocationTextExtractionStrategy, the method you want to override is: IsChunkAtWordBoundary(...)