How to get data from pdf in Cyrillic?
问题 I have error when I try to get data in cyrillic import codecs pdfFileObj = codecs.open('1.pdf', 'rb','utf-8') The error is 'utf8' codec can't decode byte 0x9c in position 1: invalid start byte 回答1: PDF is not a textfile PDF is not unicode, PDF is full of binary streams, with text, images and so on. Use some PDF library Take look at PyPDF2. To get text from first page do pdf = PdfFileReader(open('/tmp/russian.pdf', 'rb')) text = pdf.getPage(0).extractText() Though you might also need to