UnicodeDecodeError when reading CSV file in Pandas with Python

后端未结

关注

 21  2523

野趣味 2020-11-22 04:27

I\'m running a program which is processing 30,000 similar files. A random number of them are stopping and producing this error...

File "C:\\Importer\\src


      
      
        
          21条回答        

        
                    
            
            
                         
                
              
              
                
                   挽巷
                                             
                
                
                (楼主)
            
              
              
                2020-11-22 05:09
              

            
            
                        
I am posting an answer to provide an updated solution and explanation as to why this problem can occur. Say you are getting this data from a database or Excel workbook. If  you have special characters like La Cañada Flintridge city, well unless you are exporting the data using UTF-8 encoding, you're going to introduce errors. La Cañada Flintridge city will become La Ca\xf1ada Flintridge city. If you are using pandas.read_csv without any adjustments to the default parameters, you'll hit the following error 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 5: invalid continuation byte


Fortunately, there are a few solutions. 

Option 1, fix the exporting. Be sure to use UTF-8 encoding. 

Option 2, if fixing the exporting problem is not available to you, and you need to use pandas.read_csv, be sure to include the following paramters, engine='python'. By default, pandas uses engine='C' which is great for reading large clean files, but will crash if anything unexpected comes up. In my experience, setting encoding='utf-8' has never fixed this UnicodeDecodeError. Also, you do not need to use errors_bad_lines, however, that is still an option if you REALLY need it.

pd.read_csv(, engine='python')


Option 3: solution is my preferred solution personally. Read the file using vanilla Python.

import pandas as pd

data = []

with open(, "rb") as myfile:
    # read the header seperately
    # decode it as 'utf-8', remove any special characters, and split it on the comma (or deliminator)
    header = myfile.readline().decode('utf-8').replace('\r\n', '').split(',')
    # read the rest of the data
    for line in myfile:
        row = line.decode('utf-8', errors='ignore').replace('\r\n', '').split(',')
        data.append(row)

# save the data as a dataframe
df = pd.DataFrame(data=data, columns = header)


Hope this helps people encountering this issue for the first time.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它21个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复