How to extract rows in a dataframe that do not exist in another

后端 未结 3 849
忘掉有多难
忘掉有多难 2020-12-11 22:14

I have two dataframes:

all_data:

              AID       VID  Freq
0        00016A3E  0127C661     1
1        00016A3E  0C05DA5D     2
2        0001         


        
相关标签:
3条回答
  • 2020-12-11 22:35

    This worked for me:

    merged = all_data.merge(subset, how='left', indicator=True)
    print(merged[merged['_merge']=='left_only'])
    

    After the merge, you can filter by the _merge with left_only value.

    0 讨论(0)
  • 2020-12-11 22:57

    If you don't have duplicates in all_data, here is a way without merge but with drop_duplicates. keep=False argument drops all duplicates rows:

    pd.concat([all_data, subset]).drop_duplicates(keep=False)
    Out[23]: 
             AID       VID  Freq
    0   00016A3E  0127C661     1
    2   00016A3E  0C032814     1
    3   00016A3E  0BF6C78D     1
    4   00016A3E  0A79DFF1     1
    5   00016A3E  07BD2FB2     1
    6   00016A3E  0790E61B     1
    8   00016A3E  073630B5     3
    9   00016A3E  06613535     1
    10  00016A3E  05F809AF     1
    11  00016A3E  05C625FF     1
    12  00016A3E  04220EA8     4
    13  00016A3E  013A29E5     1
    14  00016A3E  0761C98A     1
    15  00016AE9  0A769475    16
    16  00016AE9  0A7DED0A     2
    17  00016AE9  0ABF60DF     9
    18  00016AE9  0AE3F25A     2
    20  00016AE9  0BD8975A     2
    21  00016AE9  44DF880B     1
    22  00016AE9  43F9E08E     2
    23  00016AE9  44EA5E08     2
    24  00016AE9  4539ED1E    16
    25  00016AE9  8516B55A     4
    26  00016AE9  0972AFF2     1
    27  00016AE9  0C559B34     1
    28  00016AE9  06B5C040     7
    
    %timeit pd.concat([all_data, subset]).drop_duplicates(keep=False)
    1000 loops, best of 3: 1.53 ms per loop
    
    %%timeit
    pd.merge(all_data, subset, how='left', indicator=True) \
               .query("_merge == 'left_only'") \
               .drop('_merge',1)
    100 loops, best of 3: 8.58 ms per loop
    
    0 讨论(0)
  • 2020-12-11 23:01

    you can use merge(..., how='left', indicator=True) together with query() method:

    In [38]: pd.merge(all_data, subset, how='left', indicator=True) \
               .query("_merge == 'left_only'") \
               .drop('_merge',1)
    Out[38]:
             AID       VID  Freq
    0   00016A3E  0127C661     1
    2   00016A3E  0C032814     1
    3   00016A3E  0BF6C78D     1
    4   00016A3E  0A79DFF1     1
    5   00016A3E  07BD2FB2     1
    6   00016A3E  0790E61B     1
    8   00016A3E  073630B5     3
    9   00016A3E  06613535     1
    10  00016A3E  05F809AF     1
    11  00016A3E  05C625FF     1
    12  00016A3E  04220EA8     4
    13  00016A3E  013A29E5     1
    14  00016A3E  0761C98A     1
    15  00016AE9  0A769475    16
    16  00016AE9  0A7DED0A     2
    17  00016AE9  0ABF60DF     9
    18  00016AE9  0AE3F25A     2
    20  00016AE9  0BD8975A     2
    21  00016AE9  44DF880B     1
    22  00016AE9  43F9E08E     2
    23  00016AE9  44EA5E08     2
    24  00016AE9  4539ED1E    16
    25  00016AE9  8516B55A     4
    26  00016AE9  0972AFF2     1
    27  00016AE9  0C559B34     1
    28  00016AE9  06B5C040     7
    

    Step by step:

    In [39]: pd.merge(all_data, subset, how='left', indicator=True)
    Out[39]:
             AID       VID  Freq     _merge
    0   00016A3E  0127C661     1  left_only
    1   00016A3E  0C05DA5D     2       both
    2   00016A3E  0C032814     1  left_only
    3   00016A3E  0BF6C78D     1  left_only
    4   00016A3E  0A79DFF1     1  left_only
    5   00016A3E  07BD2FB2     1  left_only
    6   00016A3E  0790E61B     1  left_only
    7   00016A3E  0C24ED25     3       both
    8   00016A3E  073630B5     3  left_only
    9   00016A3E  06613535     1  left_only
    10  00016A3E  05F809AF     1  left_only
    11  00016A3E  05C625FF     1  left_only
    12  00016A3E  04220EA8     4  left_only
    13  00016A3E  013A29E5     1  left_only
    14  00016A3E  0761C98A     1  left_only
    15  00016AE9  0A769475    16  left_only
    16  00016AE9  0A7DED0A     2  left_only
    17  00016AE9  0ABF60DF     9  left_only
    18  00016AE9  0AE3F25A     2  left_only
    19  00016AE9  0AEFE12F     5       both
    20  00016AE9  0BD8975A     2  left_only
    21  00016AE9  44DF880B     1  left_only
    22  00016AE9  43F9E08E     2  left_only
    23  00016AE9  44EA5E08     2  left_only
    24  00016AE9  4539ED1E    16  left_only
    25  00016AE9  8516B55A     4  left_only
    26  00016AE9  0972AFF2     1  left_only
    27  00016AE9  0C559B34     1  left_only
    28  00016AE9  06B5C040     7  left_only
    29  00016AE9  0B0426FA     1       both
    
    In [40]: pd.merge(all_data, subset, how='left', indicator=True).query("_merge == 'left_only'")
    Out[40]:
             AID       VID  Freq     _merge
    0   00016A3E  0127C661     1  left_only
    2   00016A3E  0C032814     1  left_only
    3   00016A3E  0BF6C78D     1  left_only
    4   00016A3E  0A79DFF1     1  left_only
    5   00016A3E  07BD2FB2     1  left_only
    6   00016A3E  0790E61B     1  left_only
    8   00016A3E  073630B5     3  left_only
    9   00016A3E  06613535     1  left_only
    10  00016A3E  05F809AF     1  left_only
    11  00016A3E  05C625FF     1  left_only
    12  00016A3E  04220EA8     4  left_only
    13  00016A3E  013A29E5     1  left_only
    14  00016A3E  0761C98A     1  left_only
    15  00016AE9  0A769475    16  left_only
    16  00016AE9  0A7DED0A     2  left_only
    17  00016AE9  0ABF60DF     9  left_only
    18  00016AE9  0AE3F25A     2  left_only
    20  00016AE9  0BD8975A     2  left_only
    21  00016AE9  44DF880B     1  left_only
    22  00016AE9  43F9E08E     2  left_only
    23  00016AE9  44EA5E08     2  left_only
    24  00016AE9  4539ED1E    16  left_only
    25  00016AE9  8516B55A     4  left_only
    26  00016AE9  0972AFF2     1  left_only
    27  00016AE9  0C559B34     1  left_only
    28  00016AE9  06B5C040     7  left_only
    
    0 讨论(0)
提交回复
热议问题