Impute categorical missing values in scikit-learn

后端未结
关注
 10  1416
清歌不尽 2020-11-30 16:55
I\'ve got pandas data with some columns of text type. There are some NaN values along with these text columns. What I\'m trying to do is to impute those NaN\'s by skle

      
      
        
          10条回答        

        
                    
            
            
                         
                
              
              
                
                   情歌与酒
                                             
                
                
                (楼主)
            
              
              
                2020-11-30 17:46
              

            
            
                        

strategy = 'most_frequent' can be used only with quantitative feature, not with qualitative.  This custom impuer can be used for both qualitative and quantitative. Also with scikit learn imputer either we can use it for whole data frame(if all features are quantitative) or we can use 'for loop' with list of similar type of features/columns(see the below example). But custom imputer can be used with any combinations.

    from sklearn.preprocessing import Imputer
    impute = Imputer(strategy='mean')
    for cols in ['quantitative_column', 'quant']:  # here both are quantitative features.
          xx[cols] = impute.fit_transform(xx[[cols]])

Custom Imputer : 

   from sklearn.preprocessing import Imputer
   from sklearn.base import TransformerMixin

   class CustomImputer(TransformerMixin):
         def __init__(self, cols=None, strategy='mean'):
               self.cols = cols
               self.strategy = strategy

         def transform(self, df):
               X = df.copy()
               impute = Imputer(strategy=self.strategy)
               if self.cols == None:
                      self.cols = list(X.columns)
               for col in self.cols:
                      if X[col].dtype == np.dtype('O') : 
                             X[col].fillna(X[col].value_counts().index[0], inplace=True)
                      else : X[col] = impute.fit_transform(X[[col]])

               return X

         def fit(self, *_):
               return self

Dataframe:

      X = pd.DataFrame({'city':['tokyo', np.NaN, 'london', 'seattle', 'san 
                                 francisco', 'tokyo'], 
          'boolean':['yes', 'no', np.NaN, 'no', 'no', 'yes'], 
          'ordinal_column':['somewhat like', 'like', 'somewhat like', 'like', 
                            'somewhat like', 'dislike'], 
          'quantitative_column':[1, 11, -.5, 10, np.NaN, 20]})


            city              boolean   ordinal_column  quantitative_column
        0   tokyo             yes       somewhat like   1.0
        1   NaN               no        like            11.0
        2   london            NaN       somewhat like   -0.5
        3   seattle           no        like            10.0
        4   san francisco     no        somewhat like   NaN
        5   tokyo             yes       dislike         20.0

1) Can be used with list of similar type of features.

 cci = CustomImputer(cols=['city', 'boolean']) # here default strategy = mean
 cci.fit_transform(X)

can be used with strategy = median

 sd = CustomImputer(['quantitative_column'], strategy = 'median')
 sd.fit_transform(X)

3) Can be used with whole data frame, it will use default mean(or we can also change it with median. for qualitative features it uses strategy = 'most_frequent' and for quantitative mean/median.

 call = CustomImputer()
 call.fit_transform(X)   


    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它10个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复