I\'m running a program which is processing 30,000 similar files. A random number of them are stopping and producing this error...
File "C:\\Importer\\src
I am posting an answer to provide an updated solution and explanation as to why this problem can occur. Say you are getting this data from a database or Excel workbook. If you have special characters like La Cañada Flintridge city
, well unless you are exporting the data using UTF-8
encoding, you're going to introduce errors. La Cañada Flintridge city
will become La Ca\xf1ada Flintridge city
. If you are using pandas.read_csv
without any adjustments to the default parameters, you'll hit the following error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 5: invalid continuation byte
Fortunately, there are a few solutions.
Option 1, fix the exporting. Be sure to use UTF-8
encoding.
Option 2, if fixing the exporting problem is not available to you, and you need to use pandas.read_csv
, be sure to include the following paramters, engine='python'
. By default, pandas uses engine='C'
which is great for reading large clean files, but will crash if anything unexpected comes up. In my experience, setting encoding='utf-8'
has never fixed this UnicodeDecodeError
. Also, you do not need to use errors_bad_lines
, however, that is still an option if you REALLY need it.
pd.read_csv(, engine='python')
Option 3: solution is my preferred solution personally. Read the file using vanilla Python.
import pandas as pd
data = []
with open(, "rb") as myfile:
# read the header seperately
# decode it as 'utf-8', remove any special characters, and split it on the comma (or deliminator)
header = myfile.readline().decode('utf-8').replace('\r\n', '').split(',')
# read the rest of the data
for line in myfile:
row = line.decode('utf-8', errors='ignore').replace('\r\n', '').split(',')
data.append(row)
# save the data as a dataframe
df = pd.DataFrame(data=data, columns = header)
Hope this helps people encountering this issue for the first time.