How to filter Spark dataframe by array column containing any of the values of some other dataframe/set

▼魔方 西西 提交于 2019-12-06 08:51:13
Vassilis Moustakas

I found an elegant solution for this, without the need to cast DataFrames/Datasets to RDDs.

Assuming you have a DataFrame dataDF:

+---------+--------+---------+
| column 1|  browse| column n|
+---------+--------+---------+
|     foo1| [X,Y,Z]|     bar1|
|     foo2|   [K,L]|     bar2|
|     foo3|     [M]|     bar3|

and an array b containing the values you want to match in browse

val b: Array[String] = Array(M,Z)

Implement the udf:

def array_contains_any(s: Seq[String]): UserDefinedFunction = udf((c: WrappedArray[String]) => c.toList.intersect(s).nonEmpty)

and then simply use the filter or where function (with a little bit of fancy currying :P) to do the filtering like:

dataDF.where(array_contains_any(b)($"browse"))

Assume input data:Dataframe A

browse
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887

and you have to match it with Dataframe B

browsenodeid:(I flatten the column browsenodeid) 123,200,300

val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\Dataframe_A.txt")
rawrdd.map(_.split("|"))
      .map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(","))
      .foreach(println)

Your output:

300,200
300
123

Updated

val matchSet = "A,Z,M".split(",").toSet

val rawrdd = sc.textFile("/FileStore/tables/mvv45x9f1494518792828/input_A.txt")

rawrdd.map(_.split("|"))
      .map(r => if (! r(1).split(",").toSet.intersect(matchSet).isEmpty) org.apache.spark.sql.Row(r(0),r(1), r(2))).collect.foreach(println)

Output is

foo1,X,Y,Z,bar1
foo3,M,bar3
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!