Spark: subtract two DataFrames

喜你入骨 提交于 2019-11-26 08:10:47

问题


In Spark version 1.2.0 one could use subtract with 2 SchemRDDs to end up with only the different content from the first one

val onlyNewData = todaySchemaRDD.subtract(yesterdaySchemaRDD)

onlyNewData contains the rows in todaySchemRDD that do not exist in yesterdaySchemaRDD.

How can this be achieved with DataFrames in Spark version 1.3.0?


回答1:


According to the api docs, doing:

dataFrame1.except(dataFrame2)

will return a new DataFrame containing rows in dataFrame1 but not in dataframe2.




回答2:


In pyspark DOCS it would be subtract

df1.subtract(df2)



回答3:


I tried subtract, but the result was not consistent. If I run df1.subtract(df2), not all lines of df1 are shown on the result dataframe, probably due distinct cited on the docs.

This solved my problem: df1.exceptAll(df2)



来源:https://stackoverflow.com/questions/29537564/spark-subtract-two-dataframes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!