How to subtract rows of one pandas data frame from another?

后端 未结 4 1171
执笔经年
执笔经年 2020-12-10 11:26

The operation that I want to do is similar to merger. For example, with the inner merger we get a data frame that contains rows that are present in the first AN

相关标签:
4条回答
  • 2020-12-10 12:05

    How about something like the following?

    print df1
    
        Team  Year  foo
    0   Hawks  2001    5
    1   Hawks  2004    4
    2    Nets  1987    3
    3    Nets  1988    6
    4    Nets  2001    8
    5    Nets  2000   10
    6    Heat  2004    6
    7  Pacers  2003   12
    
    print df2
    
        Team  Year  foo
    0  Pacers  2003   12
    1    Heat  2004    6
    2    Nets  1988    6
    

    As long as there is a non-key commonly named column, you can let the added on sufffexes do the work (if there is no non-key common column then you could create one to use temporarily ... df1['common'] = 1 and df2['common'] = 1):

    new = df1.merge(df2,on=['Team','Year'],how='left')
    print new[new.foo_y.isnull()]
    
         Team  Year  foo_x  foo_y
    0  Hawks  2001      5    NaN
    1  Hawks  2004      4    NaN
    2   Nets  1987      3    NaN
    4   Nets  2001      8    NaN
    5   Nets  2000     10    NaN
    

    Or you can use isin but you would have to create a single key:

    df1['key'] = df1['Team'] + df1['Year'].astype(str)
    df2['key'] = df1['Team'] + df2['Year'].astype(str)
    print df1[~df1.key.isin(df2.key)]
    
         Team  Year  foo         key
    0   Hawks  2001    5   Hawks2001
    2    Nets  1987    3    Nets1987
    4    Nets  2001    8    Nets2001
    5    Nets  2000   10    Nets2000
    6    Heat  2004    6    Heat2004
    7  Pacers  2003   12  Pacers2003
    
    0 讨论(0)
  • 2020-12-10 12:10

    Consider Following:

    1. df_one is first DataFrame
    2. df_two is second DataFrame

    Present in First DataFrame and Not in Second DataFrame

    Solution: by Index df = df_one[~df_one.index.isin(df_two.index)]

    index can be replaced by required column upon which you wish to do exclusion. In above example, I've used index as a reference between both Data Frames

    Additionally, you can also use a more complex query using boolean pandas.Series to solve for above.

    0 讨论(0)
  • 2020-12-10 12:13

    I suggest using parameter 'indicator' in merge. Also if 'on' is None this defaults to the intersection of the columns in both DataFrames.

    new = df1.merge(df2,how='left', indicator=True) # adds a new column '_merge'
    new = new[(new['_merge']=='left_only')].copy() #rows only in df1 and not df2
    new = new.drop(columns='_merge').copy()
    
        Team    Year    foo
    0   Hawks   2001    5
    1   Hawks   2004    4
    2   Nets    1987    3
    4   Nets    2001    8
    5   Nets    2000    10
    

    Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html

    indicator : boolean or string, default False
    
    If True, adds a column to output DataFrame called “_merge” with information on the source of each row. 
    Information column is Categorical-type and takes on a value of 
    “left_only” for observations whose merge key only appears in ‘left’ DataFrame,
    “right_only” for observations whose merge key only appears in ‘right’ DataFrame, 
    and “both” if the observation’s merge key is found in both.
    
    0 讨论(0)
  • 2020-12-10 12:22

    You could run into errors if your non-index column has cells with NaN.

    print df1
    
        Team   Year  foo
    0   Hawks  2001    5
    1   Hawks  2004    4
    2    Nets  1987    3
    3    Nets  1988    6
    4    Nets  2001    8
    5    Nets  2000   10
    6    Heat  2004    6
    7  Pacers  2003   12
    8 Problem  2112  NaN
    
    
    print df2
    
         Team  Year  foo
    0  Pacers  2003   12
    1    Heat  2004    6
    2    Nets  1988    6
    3 Problem  2112  NaN
    
    new = df1.merge(df2,on=['Team','Year'],how='left')
    print new[new.foo_y.isnull()]
    
         Team  Year  foo_x  foo_y
    0   Hawks  2001      5    NaN
    1   Hawks  2004      4    NaN
    2    Nets  1987      3    NaN
    4    Nets  2001      8    NaN
    5    Nets  2000     10    NaN
    6 Problem  2112    NaN    NaN
    

    The problem team in 2112 has no value for foo in either table. So, the left join here will falsely return that row, which matches in both DataFrames, as not being present in the right DataFrame.

    Solution:

    What I do is to add a unique column to the inner DataFrame and set a value for all rows. Then when you join, you can check to see if that column is NaN for the inner table to find unique records in the outer table.

    df2['in_df2']='yes'
    
    print df2
    
         Team  Year  foo  in_df2
    0  Pacers  2003   12     yes
    1    Heat  2004    6     yes
    2    Nets  1988    6     yes
    3 Problem  2112  NaN     yes
    
    
    new = df1.merge(df2,on=['Team','Year'],how='left')
    print new[new.in_df2.isnull()]
    
         Team  Year  foo_x  foo_y  in_df1  in_df2
    0   Hawks  2001      5    NaN     yes     NaN
    1   Hawks  2004      4    NaN     yes     NaN
    2    Nets  1987      3    NaN     yes     NaN
    4    Nets  2001      8    NaN     yes     NaN
    5    Nets  2000     10    NaN     yes     NaN
    

    NB. The problem row is now correctly filtered out, because it has a value for in_df2.

      Problem  2112    NaN    NaN     yes     yes
    
    0 讨论(0)
提交回复
热议问题