Intersection of two or more DataFrame columns

后端未结

关注

 3  624

I am trying to find the intersect of three dataframes, however the pd.intersect1d does not like to use three dataframes.

import numpy as np
imp


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  予麋鹿        
                
              
                            
                2020-12-18 12:43
              
            
            
                                                                       
inclusive_list = np.intersect1d(np.intersect1d(df1.columns, df2.columns), df3.columns)


Note that the arguments passed to np.intersect1d (https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.intersect1d.html) are expected to be two arrays (ar1 and ar2).

Passing 3 arrays means that the assume_unique variable within the function is being set as an array (expected to be a bool).

You can also use simple native python set methods if you don't want to use numpy

inclusive_list = set(df1.columns).intersection(set(df2.columns)).intersection(set(df3.columns))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦谈多话        
                
              
                            
                2020-12-18 12:46
              
            
            
                                                                       
Why your current approach doesn't work:

intersect1d does not take N arrays, it only compares 2.


  numpy.intersect1d(ar1, ar2, assume_unique=False, return_indices=False)


You can see from the definition that you are passing the third array as the assume_unique parameter, and since you are treating an array like a single boolean, you receive a ValueError. 



You can extend the functionality of intersect1d to work on N arrays using functools.reduce:

from functools import reduce
reduce(np.intersect1d, (df1.columns, df2.columns, df3.columns))




array(['C', 'D'], dtype=object)




A better approach

However, the easiest approach is to just use intersection on the Index object:

df1.columns & df2.columns & df3.columns




Index(['C', 'D'], dtype='object')

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  独厮守ぢ        
                
              
                            
                2020-12-18 13:09
              
            
            
                                                                       
You can using concat 

pd.concat([df1.head(1),df2.head(1),df3.head(1)],join='inner').columns
Out[81]: Index(['C', 'D'], dtype='object')

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复