问题
How can i read pdf in python? I know one way of converting it to text, but i want to read the content directly from pdf.
Can anyone explain which module in python is best for pdf extraction
回答1:
You can USE PyPDF2 package
#install pyDF2
pip install PyPDF2
# importing all the required modules
import PyPDF2
# creating an object
file = open('example.pdf', 'rb')
# creating a pdf reader object
fileReader = PyPDF2.PdfFileReader(file)
# print the number of pages in pdf file
print(fileReader.numPages)
Follow this Documentation http://pythonhosted.org/PyPDF2/
回答2:
You can use textract module in python
Textract
for install
pip install textract
for read pdf
import textract
text = textract.process('path/to/pdf/file', method='pdfminer')
For detail Textract
回答3:
Try PyPDF2.
There is a good tutorial here: https://automatetheboringstuff.com/chapter13/
来源:https://stackoverflow.com/questions/45795089/how-can-i-read-pdf-in-python