Difference(s) between merge() and concat() in pandas

二次信任 提交于 2019-11-26 15:12:33

问题


What's the essential difference(s) between pd.DataFrame.merge() and pd.concat()?

So far, this is what I found, please comment on how complete and accurate my understanding is:

  • .merge() can only use columns (plus row-indices) and it is semantically suitable for database-style operations. .concat() can be used with either axis, using only indices, and gives the option for adding a hierarchical index.

  • Incidentally, this allows for the following redundancy: both can combine two dataframes using the rows indices.

  • pd.DataFrame.join() merely offers a shorthand for a subset of the use cases of .merge()

(Pandas is great at addressing a very wide spectrum of use cases in data analysis. It can be a bit daunting exploring the documentation to figure out what is the best way to perform a particular task. )


回答1:


A very high level difference is that merge() is used to combine two (or more) dataframes on the basis of values of common columns (indices can also be used, use left_index=True and/or right_index=True), and concat() is used to append one (or more) dataframes one below the other (or sideways, depending on whether the axis option is set to 0 or 1).

join() is used to merge 2 dataframes on the basis of the index; instead of using merge() with the option left_index=True we can use join().

For example:

df1 = pd.DataFrame({'Key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})

df1:
   Key  data1
0   b   0
1   b   1
2   a   2
3   c   3
4   a   4
5   a   5
6   b   6

df2 = pd.DataFrame({'Key': ['a', 'b', 'd'], 'data2': range(3)})

df2:
    Key data2
0   a   0
1   b   1
2   d   2

#Merge
# The 2 dataframes are merged on the basis of values in column "Key" as it is 
# a common column in 2 dataframes

pd.merge(df1, df2)

   Key data1 data2
0   b    0    1
1   b    1    1
2   b    6    1
3   a    2    0
4   a    4    0
5   a    5    0

#Concat
# df2 dataframe is appended at the bottom of df1 

pd.concat([df1, df2])

   Key data1 data2
0   b   0     NaN
1   b   1     NaN
2   a   2     NaN
3   c   3     NaN
4   a   4     NaN
5   a   5     NaN
6   b   6     NaN
0   a   Nan   0
1   b   Nan   1
2   d   Nan   2



回答2:


pd.concat takes an Iterable as its argument. Hence, it cannot take DataFrames directly as its argument. Also Dimensions of the DataFrame should match along axis while concatenating.

pd.merge can take DataFrames as its argument, and is used to combine two DataFrames with same columns or index, which can't be done with pd.concat since it will show the repeated column in the DataFrame.

Whereas join can be used to join two DataFrames with different indices.




回答3:


I am currently trying to understand the essential difference(s) between pd.DataFrame.merge() and pd.concat().

Nice question. The main difference:

pd.concat works on both axes.

The other difference, is pd.concat has innerdefault and outer joins only, while pd.DataFrame.merge() has left, right, outer, innerdefault joins.

Third notable other difference is: pd.DataFrame.merge() has the option to set the column suffixes when merging columns with the same name, while for pd.concat this is not possible.


With pd.concat by default you are able to stack rows of multiple dataframes (axis=0) and when you set the axis=1 then you mimic the pd.DataFrame.merge() function.

Some useful examples of pd.concat:

df2=pd.concat([df]*2, ignore_index=True) #double the rows of a dataframe

df2=pd.concat([df, df.iloc[[0]]]) # add first row to the end

df3=pd.concat([df1,df2], join='inner', ignore_index=True) # concat two df's



回答4:


by default:
join is a column-wise left join
pd.merge is a column-wise inner join
pd.concat is a row-wise outer join

pd.concat:
takes Iterable arguments. Thus, it cannot take DataFrames directly (use [df,df2])
Dimensions of DataFrame should match along axis

Join and pd.merge:
can take DataFrame arguments

Click to see picture for understanding why code below does the same thing

df1.join(df2)
pd.merge(df1, df2, left_index=True, right_index=True)
pd.concat([df1, df2], axis=1)


来源:https://stackoverflow.com/questions/38256104/differences-between-merge-and-concat-in-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!