UnicodeDecodeError when reading CSV file in Pandas with Python

后端 未结 21 2523
野趣味
野趣味 2020-11-22 04:27

I\'m running a program which is processing 30,000 similar files. A random number of them are stopping and producing this error...

File "C:\\Importer\\src         


        
21条回答
  •  挽巷
    挽巷 (楼主)
    2020-11-22 05:09

    I am posting an answer to provide an updated solution and explanation as to why this problem can occur. Say you are getting this data from a database or Excel workbook. If you have special characters like La Cañada Flintridge city, well unless you are exporting the data using UTF-8 encoding, you're going to introduce errors. La Cañada Flintridge city will become La Ca\xf1ada Flintridge city. If you are using pandas.read_csv without any adjustments to the default parameters, you'll hit the following error

    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 5: invalid continuation byte
    

    Fortunately, there are a few solutions.

    Option 1, fix the exporting. Be sure to use UTF-8 encoding.

    Option 2, if fixing the exporting problem is not available to you, and you need to use pandas.read_csv, be sure to include the following paramters, engine='python'. By default, pandas uses engine='C' which is great for reading large clean files, but will crash if anything unexpected comes up. In my experience, setting encoding='utf-8' has never fixed this UnicodeDecodeError. Also, you do not need to use errors_bad_lines, however, that is still an option if you REALLY need it.

    pd.read_csv(, engine='python')
    

    Option 3: solution is my preferred solution personally. Read the file using vanilla Python.

    import pandas as pd
    
    data = []
    
    with open(, "rb") as myfile:
        # read the header seperately
        # decode it as 'utf-8', remove any special characters, and split it on the comma (or deliminator)
        header = myfile.readline().decode('utf-8').replace('\r\n', '').split(',')
        # read the rest of the data
        for line in myfile:
            row = line.decode('utf-8', errors='ignore').replace('\r\n', '').split(',')
            data.append(row)
    
    # save the data as a dataframe
    df = pd.DataFrame(data=data, columns = header)
    

    Hope this helps people encountering this issue for the first time.

提交回复
热议问题