Spark filter DataFrames based on common values

断了今生、忘了曾经 提交于 2019-12-12 06:55:56

问题


I have DF1 and DF2. First one has a column "new_id", the second has a column "db_id"

I need to FILTER OUT all the rows from the first DataFrame, where the value of new_id is not in db_id.

val new_id = Seq(1, 2, 3, 4)
val db_id = Seq(1, 4, 5, 6, 10)

Then I need the rows with new_id == 1 and 4 to stay in df1 and delete the rows with news_id = 2 and 3 since 2 and 3 are not in db_id

There is a ton of questions on DataFrames here. I might have missed this one. Sorry if that is a duplicate.

p.s I am using Scala if that matters.


回答1:


What you need is an left-semi jon:

import spark.implicits._

val DF1 = Seq(1,3).toDF("new_id")
val DF2 = Seq(1,2).toDF("db_id")


DF1.as("df1").join(DF2.as("df2"),$"df1.new_id"===$"df2.db_id","leftsemi")
.show()

+------+
|new_id|
+------+
|     1|
+------+


来源:https://stackoverflow.com/questions/49658745/spark-filter-dataframes-based-on-common-values

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!