问题
My dataframe reads like :
df1
user_id username firstname lastname
123 abc abc abc
456 def def def
789 ghi ghi ghi
df2
user_id username firstname lastname
111 xyz xyz xyz
456 def def def
234 mnp mnp mnp
Now I want a output dataframe like
user_id username firstname lastname
123 abc abc abc
456 def def def
789 ghi ghi ghi
111 xyz xyz xyz
234 mnp mnp mnp
As user_id 456 is common across both the dataframes. I have tried groupby on user_id groupby(['user_id']) . But looks like groupby need to be followed by some aggregation which I don't want here.
回答1:
Use concat + drop_duplicates:
df = pd.concat([df1, df2]).drop_duplicates('user_id').reset_index(drop=True)
print (df)
user_id username firstname lastname
0 123 abc abc abc
1 456 def def def
2 789 ghi ghi ghi
3 111 xyz xyz xyz
4 234 mnp mnp mnp
Solution with groupby and aggregate first is slowier:
df = pd.concat([df1, df2]).groupby('user_id', as_index=False, sort=False).first()
print (df)
user_id username firstname lastname
0 123 abc abc abc
1 456 def def def
2 789 ghi ghi ghi
3 111 xyz xyz xyz
4 234 mnp mnp mnp
EDIT:
Another solution with boolean indexing and numpy.in1d:
df = pd.concat([df1, df2[~np.in1d(df2['user_id'], df1['user_id'])]], ignore_index=True)
print (df)
user_id username firstname lastname
0 123 abc abc abc
1 456 def def def
2 789 ghi ghi ghi
3 111 xyz xyz xyz
4 234 mnp mnp mnp
回答2:
One approach with masking -
def app1(df1,df2):
df20 = df2[~df2.user_id.isin(df1.user_id)]
return pd.concat([df1, df20],axis=0)
Two more approaches using the underlying array data, np.in1d, np.searchsorted to get the mask of matches and then stacking those two and constructing an output dataframe from the stacked array data -
def app2(df1,df2):
df20_arr = df2.values[~np.in1d(df1.user_id.values, df2.user_id.values)]
arr = np.vstack(( df1.values, df20_arr ))
df_out = pd.DataFrame(arr, columns= df1.columns)
return df_out
def app3(df1,df2):
a = df1.values
b = df2.values
df20_arr = b[~np.in1d(a[:,0], b[:,0])]
arr = np.vstack(( a, df20_arr ))
df_out = pd.DataFrame(arr, columns= df1.columns)
return df_out
def app4(df1,df2):
a = df1.values
b = df2.values
b0 = b[:,0].astype(int)
as0 = np.sort(a[:,0].astype(int))
df20_arr = b[as0[np.searchsorted(as0,b0)] != b0]
arr = np.vstack(( a, df20_arr ))
df_out = pd.DataFrame(arr, columns= df1.columns)
return df_out
Timings for given sample -
In [49]: %timeit app1(df1,df2)
...: %timeit app2(df1,df2)
...: %timeit app3(df1,df2)
...: %timeit app4(df1,df2)
...:
1000 loops, best of 3: 753 µs per loop
10000 loops, best of 3: 192 µs per loop
10000 loops, best of 3: 181 µs per loop
10000 loops, best of 3: 171 µs per loop
# @jezrael's edited solution
In [85]: %timeit pd.concat([df1, df2[~np.in1d(df2['user_id'], df1['user_id'])]], ignore_index=True)
1000 loops, best of 3: 614 µs per loop
Would be interesting to see how these fare on larger datasets.
回答3:
Another approach is to use np.in1d to check for duplicate user_id.
pd.concat([df1,df2[df2.user_id.isin(np.setdiff1d(df2.user_id,df1.user_id))]])
Or to use a set to get unique rows from the merged records from df1 and df2. This one seems to be a few times faster.
pd.DataFrame(data=np.vstack({tuple(row) for row in np.r_[df1.values,df2.values]}),columns=df1.columns)
Timings:
%timeit pd.concat([df1,df2[df2.user_id.isin(np.setdiff1d(df2.user_id,df1.user_id))]])
1000 loops, best of 3: 2.48 ms per loop
%timeit pd.DataFrame(data=np.vstack({tuple(row) for row in np.r_[df1.values,df2.values]}),columns=df1.columns)
1000 loops, best of 3: 632 µs per loop
来源:https://stackoverflow.com/questions/44236940/concat-python-dataframes-based-on-unique-rows