Pandas Left Outer Join results in table larger than left table

后端 未结 3 1256
长情又很酷
长情又很酷 2020-11-29 04:32

From what I understand about a left outer join, the resulting table should never have more rows than the left table...Please let me know if this is wrong...

My left

3条回答
  •  独厮守ぢ
    2020-11-29 04:46

    There are also strategies you can use to avoid this behavior that don't involve losing the duplicated data if, for example, not all columns are duplicated. If you have

    In [1]: df = pd.DataFrame([[1, 3], [2, 4]], columns=['A', 'B'])
    
    In [2]: df2 = pd.DataFrame([[1, 5], [1, 6]], columns=['A', 'C'])
    

    One way would be to take the mean of the duplicate (can also take the sum, etc...)

    In [3]: df3 = df2.groupby('A').mean().reset_index()
    
    In [4]: df3
    Out[4]: 
         C
    A     
    1  5.5
    
    In [5]: merged = pd.merge(df,df3,on=['A'], how='outer')
    
    In [6]: merged
    Out[204]: 
       A  B    C
    0  1  3  5.5
    1  2  4  NaN
    

    Alternatively, if you have non-numeric data that cannot be converted using pd.to_numeric() or if you simply do not want to take the mean, you can alter the merging variable by enumerating the duplicates. However, this strategy would apply when the duplicates exist in both datasets (which would cause the same problematic behavior and is also a common problem):

    In [7]: df = pd.DataFrame([['a', 3], ['b', 4],['b',0]], columns=['A', 'B'])
    
    In [8]: df2 = pd.DataFrame([['a', 3], ['b', 8],['b',5]], columns=['A', 'C'])
    
    In [9]: df['count'] = df.groupby('A')['B'].cumcount()
    
    In [10]: df['A'] = np.where(df['count']>0,df['A']+df['count'].astype(str),df['A'].astype(str))
    
    In[11]: df
    Out[11]: 
        A  B  count
    0   a  3      0
    1   b  4      0
    2  b1  0      1
    

    Do the same for df2, drop the count variables in df and df2 and merge on 'A':

    In [16]: merged
    Out[16]: 
        A  B  C
    0   a  3  3        
    1   b  4  8        
    2  b1  0  5        
    

    A couple of notes. In this last case I use .cumcount() instead of .duplicated because it could be the case that you have more than one duplicate for a given observation. Also, I use .astype(str) to convert the count values to strings because I use the np.where() command, but using pd.concat() or something else might allow for different applications.

    Finally, if it is the case that only one dataset has the duplicates but you still want to keep them then you can use the first half of the latter strategy to differentiate the duplicates in the resulting merge.

提交回复
热议问题