Filter DataFrame based on words in array in Apache Spark

问题

I am trying to Filter a Dataset by getting only those rows that contains words in array. I am using contains method,it works for string but not working for array. Below is code

val dataSet = spark.read.option("header","true").option("inferschema","true").json(path).na.drop.cache()

val threats_path = spark.read.textFile("src/main/resources/cyber_threats").collect()

val newData = dataSet.select("*").filter(col("_source.raw_text").contains(threats_path)).show()

It is not working becuase threats_path is array of strings and contains work for string. Any help would be appreciated.

回答1:

You can use isin udf on columns

It will go something like,

val threats_path = spark.read.textFile("src/main/resources/cyber_threats").collect()

val dataSet = ???

dataSet.where(col("_source.raw_text").isin(thread_path: _*))

Note if the size of thread_paths is big, this will have performance impact both because of collect and because of filter using isin.

I'll suggest you to use filter dataSet with threats_path using join. It will go something like,

val dataSet = spark.read.option("header","true").option("inferschema","true").json(path).na.drop

val threats_path = spark.read.textFile("src/main/resources/cyber_threats")

val newData = threats_path.join(dataSet, col("_source.raw_text") === col("<col in threats_path >"), "leftouter").show()

Hope this helps

来源：https://stackoverflow.com/questions/52613487/filter-dataframe-based-on-words-in-array-in-apache-spark

标签

scala

apache-spark

dataframe

apache-spark-sql

bigdata