pyPdf unable to extract text from some pages in my PDF

后端未结

关注

 6  1103

伪装坚强ぢ 2021-01-05 13:07

I\'m trying to use pyPdf to extract and print pages from a multipage PDF. Problem is, text is not extracted from some pages. I\'ve put an example file here:

http://w

6条回答

夕颜 (楼主)

2021-01-05 13:35
Note that extractText() still has problems extracting the text properly. From the documentation for extractText():

This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.

Since it is the text you want, you can use the Linux command pdftotext.

To invoke that using Python, you can do this:
```
>>> import subprocess
>>> subprocess.call(['pdftotext', 'forms.pdf', 'output'])
```
The text is extracted from forms.pdf and saved to output.

This works in the case of your PDF file and extracts the text you want.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...