问题
What's the essential difference(s) between pd.DataFrame.merge()
and pd.concat()
?
So far, this is what I found, please comment on how complete and accurate my understanding is:
.merge()
can only use columns (plus row-indices) and it is semantically suitable for database-style operations..concat()
can be used with either axis, using only indices, and gives the option for adding a hierarchical index.Incidentally, this allows for the following redundancy: both can combine two dataframes using the rows indices.
pd.DataFrame.join()
merely offers a shorthand for a subset of the use cases of.merge()
(Pandas is great at addressing a very wide spectrum of use cases in data analysis. It can be a bit daunting exploring the documentation to figure out what is the best way to perform a particular task. )
回答1:
A very high level difference is that merge()
is used to combine two (or more) dataframes on the basis of values of common columns (indices can also be used, use left_index=True
and/or right_index=True
), and concat()
is used to append one (or more) dataframes one below the other (or sideways, depending on whether the axis
option is set to 0 or 1).
join()
is used to merge 2 dataframes on the basis of the index; instead of using merge()
with the option left_index=True
we can use join()
.
For example:
df1 = pd.DataFrame({'Key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df1:
Key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
df2 = pd.DataFrame({'Key': ['a', 'b', 'd'], 'data2': range(3)})
df2:
Key data2
0 a 0
1 b 1
2 d 2
#Merge
# The 2 dataframes are merged on the basis of values in column "Key" as it is
# a common column in 2 dataframes
pd.merge(df1, df2)
Key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
#Concat
# df2 dataframe is appended at the bottom of df1
pd.concat([df1, df2])
Key data1 data2
0 b 0 NaN
1 b 1 NaN
2 a 2 NaN
3 c 3 NaN
4 a 4 NaN
5 a 5 NaN
6 b 6 NaN
0 a Nan 0
1 b Nan 1
2 d Nan 2
回答2:
pd.concat
takes an Iterable
as its argument. Hence, it cannot take DataFrame
s directly as its argument. Also Dimension
s of the DataFrame
should match along axis while concatenating.
pd.merge
can take DataFrame
s as its argument, and is used to combine two DataFrame
s with same columns or index, which can't be done with pd.concat
since it will show the repeated column in the DataFrame.
Whereas join can be used to join two DataFrame
s with different indices.
回答3:
I am currently trying to understand the essential difference(s) between
pd.DataFrame.merge()
andpd.concat()
.
Nice question. The main difference:
pd.concat works on both axes.
The other difference, is pd.concat
has innerdefault and outer joins only, while pd.DataFrame.merge() has left, right, outer, innerdefault joins.
Third notable other difference is: pd.DataFrame.merge()
has the option to set the column suffixes when merging columns with the same name, while for pd.concat
this is not possible.
With pd.concat
by default you are able to stack rows of multiple dataframes (axis=0
) and when you set the axis=1
then you mimic the pd.DataFrame.merge()
function.
Some useful examples of pd.concat
:
df2=pd.concat([df]*2, ignore_index=True) #double the rows of a dataframe
df2=pd.concat([df, df.iloc[[0]]]) # add first row to the end
df3=pd.concat([df1,df2], join='inner', ignore_index=True) # concat two df's
回答4:
by default:
join is a column-wise left join
pd.merge is a column-wise inner join
pd.concat is a row-wise outer join
pd.concat:
takes Iterable arguments. Thus, it cannot take DataFrames directly (use [df,df2])
Dimensions of DataFrame should match along axis
Join and pd.merge:
can take DataFrame arguments
Click to see picture for understanding why code below does the same thing
df1.join(df2)
pd.merge(df1, df2, left_index=True, right_index=True)
pd.concat([df1, df2], axis=1)
来源:https://stackoverflow.com/questions/38256104/differences-between-merge-and-concat-in-pandas