'utf-8' codec can't decode byte 0xe2 : invalid continuation byte error

谁说胖子不能爱 提交于 2020-05-24 03:41:06

问题


I am trying to read all PDF files from a folder to look for a number using regular expression. On inspection, the charset for PDFs is 'UTF-8'.

Throws this error:

'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Tried reading in binary mode, tried Latin-1 encoding, but it shows all special characters so nothing shows up in search.

import os
import re
import pandas as pd
download_file_path = "C:\\Users\\...\\..\\"
for file_name in os.listdir(download_file_path):
    try:
        with open(download_file_path + file_name, 'r',encoding="UTF-8") as f:
          s = f.read()
          re_api = re.compile("API No\.\:\n(.*)")
          api = re_api.search(s).group(1).split('"')[0].strip()
          print(api)
    except Exception as e:
        print(e)

Expecting to find API number from PDF files


回答1:


When you open a file with open(..., 'r', encoding='utf-8') you are basically guaranteeing that this is a text file containing no bytes which are not UTF-8. But of course, this guarantee cannot hold for a PDF file - it is a binary format which may or may not contain strings in UTF-8. But that's not how you read it.

If you have access to a library which reads PDF and extracts text strings, you could do

# Dunno if such a library exists, but bear with ...
instance = myFantasyPDFlibrary('file.pdf')
for text_snippet in instance.enumerate_texts_in_PDF():
    if 'API No.:\n' in text_snippet:
        api = text_snippet.split('API No.:\n')[1].split('\n')[0].split('"')[0].strip()

More realistically, but in a more pedestrian fashion, you could read the PDF file as a binary file, and look for the encoded text.

with open('file.pdf', 'rb') as pdf:
    pdfbytes = pdf.read()
if b'API No.:\n' in pdfbytes:
    api_text = pdfbytes.split(b'API No.:\n')[1].split(b'\n')[0].decode('utf-8')
    api = api_text.split('"')[0].strip()

A crude workaround is to lie to Python about the encoding, and claim that it's actually Latin-1. This particular encoding has the attractive feature that every byte maps exactly to its own Unicode code point, so you can read binary data as text and get away with it. But then, of course, any actual UTF-8 will be converted to mojibake (so "hëlló" will render as "hëlló" for example). You can extract actual UTF-8 text by converting the text back to bytes and then decoding it with the correct encoding (latintext.encode('latin-1').decode('utf-8')).




回答2:


PDF files are stored as bytes. Therefore to read or write a PDF file you need to use rb or wb.

with open(file, 'rb') as fopen:
    q = fopen.read()
    print(q.decode())

'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte might occur because of your editor or the PDF is not utf encoded(generally).

Therefore use ,

with open(file, 'rb') as fopen:
        q = fopen.read()
        print(q.decode('latin-1')) #or any encoding which is suitable here.

If your editor console is incompatible then also you wont be able to see any output.

A NOTE : you can't use encoding param while using rb so you have to decode after reading the file.




回答3:


The problem may be due to your computer name, I got this error in Python Django framework

Solution is "Your computer name must not contain special characters", Plase check and change your computer name...Changing computer name image



来源:https://stackoverflow.com/questions/56453782/utf-8-codec-cant-decode-byte-0xe2-invalid-continuation-byte-error

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!