If you are trying to filter a DataFrame
using another, you should use join
(or any of its variants). If what you need is to filter it using a List
or any data structure that fits in your master and workers you could broadcast it, then reference it inside the filter
or where
method.
For instance I would do something like:
import org.apache.spark.sql.functions._
val df1 = Seq(("Hampstead", "London"),
("Spui", "Amsterdam"),
("Chittagong", "Chennai")).toDF("location", "city")
val df2 = Seq(("London"),("Amsterdam"), ("New York")).toDF("cities")
df2.join(df1, joinExprs=df1("city") === df2("cities"), joinType="full_outer")
.select("city", "cities")
.where(isnull($"cities"))
.drop("cities").show()