itext java pdf to text creation

前端未结

关注

 3  1630

栀梦 2020-12-02 02:55

I use a itext for converting pdf to text file, it works good actually but for some words it do the following thing: for example in pdf there is phrase like \"present the mai

3条回答

庸人自扰 (楼主)

2020-12-02 03:26
To expand on the brilliant explanation by mkl, here is a detail for a specific variation of the issue presented in the question. I stumbled upon a document from which I wanted to extract text. Every letter came out seperated by a space.
```
text would read as "t e x t"
```
I tried implementing my own extraction strategy class as outlined by mkl. Whichever factor I tried to apply to the "single space width" value, the text came out the same way as before. So I debugged my code to see the width value itself and it turned out to be 0.

To circumvent that you can use a fix value in the code outlined by mkl:
```
float spacing = lastEnd.subtract(start).length();
if (spacing > someFixValue)
{
    result.append(' ');
}
```
if you base your own extraction strategy on LocationTextExtractionStrategy, the method you want to override is: IsChunkAtWordBoundary(...)
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...