UnicodeDecodeError when reading CSV file in Pandas with Python

后端 未结 21 2536
野趣味
野趣味 2020-11-22 04:27

I\'m running a program which is processing 30,000 similar files. A random number of them are stopping and producing this error...

File "C:\\Importer\\src         


        
21条回答
  •  温柔的废话
    2020-11-22 05:08

    Pandas allows to specify encoding, but does not allow to ignore errors not to automatically replace the offending bytes. So there is no one size fits all method but different ways depending on the actual use case.

    1. You know the encoding, and there is no encoding error in the file. Great: you have just to specify the encoding:

      file_encoding = 'cp1252'        # set file_encoding to the file encoding (utf8, latin1, etc.)
      pd.read_csv(input_file_and_path, ..., encoding=file_encoding)
      
    2. You do not want to be bothered with encoding questions, and only want that damn file to load, no matter if some text fields contain garbage. Ok, you only have to use Latin1 encoding because it accept any possible byte as input (and convert it to the unicode character of same code):

      pd.read_csv(input_file_and_path, ..., encoding='latin1')
      
    3. You know that most of the file is written with a specific encoding, but it also contains encoding errors. A real world example is an UTF8 file that has been edited with a non utf8 editor and which contains some lines with a different encoding. Pandas has no provision for a special error processing, but Python open function has (assuming Python3), and read_csv accepts a file like object. Typical errors parameter to use here are 'ignore' which just suppresses the offending bytes or (IMHO better) 'backslashreplace' which replaces the offending bytes by their Python’s backslashed escape sequence:

      file_encoding = 'utf8'        # set file_encoding to the file encoding (utf8, latin1, etc.)
      input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace')
      pd.read_csv(input_fd, ...)
      

提交回复
热议问题