How to make pipeline for multiple dataframe columns?

前端未结

关注

 3  1503

日久生厌 2020-12-21 08:29

I have Dataframe which can be simplified to this:

import pandas as pd

df = pd.DataFrame([{
\'title\': \'batman\',
\'text\': \'man bat man bat\', 
\'url\': \


      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   醉话见心
                                             
                
                
                (楼主)
            
              
              
                2020-12-21 09:08
              

            
            
                        
@elphz answer is a good intro to how you could use FeatureUnion and FunctionTransformer to accomplish this, but I think it could use a little more detail.

First off I would say you need to define your FunctionTransformer functions such that they can handle and return your input data properly. In this case I assume you just want to pass the DataFrame, but ensure that you get back a properly shaped array for use downstream. Therefore I would propose passing just the DataFrame and accessing by column name. Like so:

def text(X):
    return X.text.values

def title(X):
    return X.title.values

pipe_text = Pipeline([('col_text', FunctionTransformer(text, validate=False))])

pipe_title = Pipeline([('col_title', FunctionTransformer(title, validate=False))])


Now, to test the variations of transformers and classifiers. I would propose using a list of transformers and a list of classifiers and simply iterating through them, much like a gridsearch.

tfidf = TfidfVectorizer()
cv = CountVectorizer()
lr = LogisticRegression()
rc = RidgeClassifier()

transformers = [('tfidf', tfidf), ('cv', cv)]
clfs = [lr, rc]

best_clf = None
best_score = 0
for tran1 in transformers:
    for tran2 in transformers:
        pipe1 = Pipeline(pipe_text.steps + [tran1])
        pipe2 = Pipeline(pipe_title.steps + [tran2])
        union = FeatureUnion([('text', pipe1), ('title', pipe2)])
        X = union.fit_transform(df)
        X_train, X_test, y_train, y_test = train_test_split(X, df.label)
        for clf in clfs:
            clf.fit(X_train, y_train)
            score = clf.score(X_test, y_test)
            if score > best_score:
                best_score = score
                best_est = clf


This is a simple example, but you can see how you could plug in any variety of transformations and classifiers in this way.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复