发表新帖

发表新帖

Python module for converting PDF to text [closed]

后端未结

关注

 13  886

陌清茗 2020-11-22 08:59

13条回答

独厮守ぢ (楼主)

2020-11-22 09:44
pyPDF works fine (assuming that you're working with well-formed PDFs). If all you want is the text (with spaces), you can just do:
```
import pyPdf
pdf = pyPdf.PdfFileReader(open(filename, "rb"))
for page in pdf.pages:
    print page.extractText()
```
You can also easily get access to the metadata, image data, and so forth.

A comment in the extractText code notes:

Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.

Whether or not this is a problem depends on what you're doing with the text (e.g. if the order doesn't matter, it's fine, or if the generator adds text to the stream in the order it will be displayed, it's fine). I have pyPdf extraction code in daily use, without any problems.
0 讨论(0)

查看其它13个回答
发布评论:

提交评论
- 加载中...

热议问题