How to subtract rows of one pandas data frame from another?

后端 未结 4 1172
执笔经年
执笔经年 2020-12-10 11:26

The operation that I want to do is similar to merger. For example, with the inner merger we get a data frame that contains rows that are present in the first AN

4条回答
  •  感情败类
    2020-12-10 12:22

    You could run into errors if your non-index column has cells with NaN.

    print df1
    
        Team   Year  foo
    0   Hawks  2001    5
    1   Hawks  2004    4
    2    Nets  1987    3
    3    Nets  1988    6
    4    Nets  2001    8
    5    Nets  2000   10
    6    Heat  2004    6
    7  Pacers  2003   12
    8 Problem  2112  NaN
    
    
    print df2
    
         Team  Year  foo
    0  Pacers  2003   12
    1    Heat  2004    6
    2    Nets  1988    6
    3 Problem  2112  NaN
    
    new = df1.merge(df2,on=['Team','Year'],how='left')
    print new[new.foo_y.isnull()]
    
         Team  Year  foo_x  foo_y
    0   Hawks  2001      5    NaN
    1   Hawks  2004      4    NaN
    2    Nets  1987      3    NaN
    4    Nets  2001      8    NaN
    5    Nets  2000     10    NaN
    6 Problem  2112    NaN    NaN
    

    The problem team in 2112 has no value for foo in either table. So, the left join here will falsely return that row, which matches in both DataFrames, as not being present in the right DataFrame.

    Solution:

    What I do is to add a unique column to the inner DataFrame and set a value for all rows. Then when you join, you can check to see if that column is NaN for the inner table to find unique records in the outer table.

    df2['in_df2']='yes'
    
    print df2
    
         Team  Year  foo  in_df2
    0  Pacers  2003   12     yes
    1    Heat  2004    6     yes
    2    Nets  1988    6     yes
    3 Problem  2112  NaN     yes
    
    
    new = df1.merge(df2,on=['Team','Year'],how='left')
    print new[new.in_df2.isnull()]
    
         Team  Year  foo_x  foo_y  in_df1  in_df2
    0   Hawks  2001      5    NaN     yes     NaN
    1   Hawks  2004      4    NaN     yes     NaN
    2    Nets  1987      3    NaN     yes     NaN
    4    Nets  2001      8    NaN     yes     NaN
    5    Nets  2000     10    NaN     yes     NaN
    

    NB. The problem row is now correctly filtered out, because it has a value for in_df2.

      Problem  2112    NaN    NaN     yes     yes
    

提交回复
热议问题