问题
I have error when I try to get data in cyrillic
import codecs
pdfFileObj = codecs.open('1.pdf', 'rb','utf-8')
The error is
'utf8' codec can't decode byte 0x9c in position 1: invalid start byte
回答1:
PDF is not a textfile
PDF is not unicode, PDF is full of binary streams, with text, images and so on.
Use some PDF library
Take look at PyPDF2. To get text from first page do
pdf = PdfFileReader(open('/tmp/russian.pdf', 'rb'))
text = pdf.getPage(0).extractText()
Though you might also need to convert it to windows-1251
text.encode('latin').decode('windows-1251')
来源:https://stackoverflow.com/questions/46581122/how-to-get-data-from-pdf-in-cyrillic