Filter DataFrame based on words in array in Apache Spark

爱⌒轻易说出口 提交于 2019-12-10 12:21:09

问题


I am trying to Filter a Dataset by getting only those rows that contains words in array. I am using contains method,it works for string but not working for array. Below is code

val dataSet = spark.read.option("header","true").option("inferschema","true").json(path).na.drop.cache()

val threats_path = spark.read.textFile("src/main/resources/cyber_threats").collect()

val newData = dataSet.select("*").filter(col("_source.raw_text").contains(threats_path)).show()

It is not working becuase threats_path is array of strings and contains work for string. Any help would be appreciated.


回答1:


You can use isin udf on columns

It will go something like,

val threats_path = spark.read.textFile("src/main/resources/cyber_threats").collect()

val dataSet = ???

dataSet.where(col("_source.raw_text").isin(thread_path: _*))

Note if the size of thread_paths is big, this will have performance impact both because of collect and because of filter using isin.

I'll suggest you to use filter dataSet with threats_path using join. It will go something like,

val dataSet = spark.read.option("header","true").option("inferschema","true").json(path).na.drop

val threats_path = spark.read.textFile("src/main/resources/cyber_threats")

val newData = threats_path.join(dataSet, col("_source.raw_text") === col("<col in threats_path >"), "leftouter").show()

Hope this helps



来源:https://stackoverflow.com/questions/52613487/filter-dataframe-based-on-words-in-array-in-apache-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!