pandas left join and update existing column

前端 未结 5 1637
忘掉有多难
忘掉有多难 2020-12-01 16:16

I am new to pandas and can\'t seem to get this to work with merge function:

>>> left       >>> right
   a  b   c       a  c   d 
0  1  4            


        
5条回答
  •  我在风中等你
    2020-12-01 16:19

    DataFrame.update() is nice, but it doesn't let you specify columns to join on and more importantly, if the other dataframe has NaN values, those NaN values will not overwrite non-nan values in the original DataFrame. To me, this is undesirable behavior.

    Here's a custom method I rolled to fix these issues. It's freshly written, so users beware..

    join_insertion()

    def join_insertion(into_df, from_df, on, cols, mult='error'):
        """
        Suppose A and B are dataframes. A has columns {foo, bar, baz} and B has columns {foo, baz, buz}
        This function allows you to do an operation like:
        "where A and B match via the column foo, insert the values of baz and buz from B into A"
        Note that this'll update A's values for baz and it'll insert buz as a new column.
        This is a lot like DataFrame.update(), but that method annoyingly ignores NaN values in B!
    
        :param into_df: dataframe you want to modify
        :param from_df: dataframe with the values you want to insert
        :param cols: list of column names (values to insert)
        :param on: list of column names (values to join on), or a dict of {into:from} column name pairs
        :param mult: if a key of into_df matches multiple rows of from_df, how should this be handled?
        an error can be raised, or the first matching value can be inserted, or the last matching value
        can be inserted
        :return: a modified copy of into_df, with updated values using from_df
        """
    
        # Infer left_on, right_on
        if (isinstance(on, dict)):
            left_on = list(on.keys())
            right_on = list(on.values())
        elif(isinstance(on, list)):
            left_on = on
            right_on = on
        elif(isinstance(on, str)):
            left_on = [on]
            right_on = [on]
        else:
            raise Exception("on should be a list or dictionary")
    
        # Make cols a list if it isn't already
        if(isinstance(cols, str)):
            cols = [cols]
    
        # Setup
        A = into_df.copy()
        B = from_df[right_on + cols].copy()
    
        # Insert row ids
        A['_A_RowId_'] = np.arange(A.shape[0])
        B['_B_RowId_'] = np.arange(B.shape[0])
    
        A = pd.merge(
            left=A,
            right=B,
            how='left',
            left_on=left_on,
            right_on=right_on,
            suffixes=(None, '_y'),
            indicator=True
        ).sort_values(['_A_RowId_', '_B_RowId_'])
    
        # Check for rows of A which got duplicated by the merge, and then handle appropriately
        if(mult == 'error'):
            if(A.groupby('_A_RowId_').size().max() > 1):
                raise Exception("At least one key of into_df matched multiple rows of from_df.")
        elif(mult == 'first'):
            A = A.groupby('_A_RowId_').first().reset_index()
        elif(mult == 'last'):
            A = A.groupby('_A_RowId_').last().reset_index()
    
        mask = A._merge == 'both'
        cols_in_both = list(set(into_df.columns.to_list()).intersection(set(cols)))
        for col in cols_in_both:
            A.loc[mask, col] = A.loc[mask, col + '_y']
    
        # Drop unwanted columns
        A.drop(columns=list(set(A.columns).difference(set(into_df.columns.to_list() + cols))), inplace=True)
    
        return A
    

    Example Use

    into_df = pd.DataFrame({
        'foo': [1, 2, 3],
        'bar': [4, 5, 6],
        'baz': [7, 8, 9]
    })
       foo  bar  baz
    0    1    4    7
    1    2    5    8
    2    3    6    9
    
    from_df = pd.DataFrame({
        'foo': [1, 3, 5, 7, 3],
        'baz': [70, 80, 90, 30, 40],
        'buz': [0, 1, 2, 3, 4]
    })
       foo  baz  buz
    0    1   70    0
    1    3   80    1
    2    5   90    2
    3    7   30    3
    4    3   40    4
    
    # Use it!
    
    join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='error')
      Exception: At least one key of into_df matched multiple rows of from_df.
    
    join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='first')
       foo  bar   baz  buz
    0    1    4  70.0  0.0
    1    2    5   8.0  NaN
    2    3    6  80.0  1.0
    
    join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='last')
       foo  bar   baz  buz
    0    1    4  70.0  0.0
    1    2    5   8.0  NaN
    2    3    6  40.0  4.0
    

    As an aside, this is one of those things I severely miss from R's data.table package. With data.table, this is as easy as x[y, Foo := i.Foo, on = c("a", "b")]

提交回复
热议问题