I\'m trying to filter a Spark dataframe based on whether the values in a column equal a list. I would like to do something like this:
filtered_df = df.where(
You can use a combination of "array", "lit" and "array_except" function to achieve this.
lit(array(lit("list"),lit("of"),lit("stuff"))
["list", "of", "stuff"]
Note: array_except function is available from spark 2.4.0.
Here is the code:
# Import libraries
from pyspark.sql.functions import *
# Create DataFrame
df = sc.parallelize([
(1, ['list','of' , 'stuff']),
(2, ['foo', 'bar']),
(3, ['foobar']),
(4, ['list','of' , 'stuff', 'and', 'foo']),
(5, ['a', 'list','of' , 'stuff']),
]).toDF(['id', 'a'])
# Solution
df1 = df.filter(size(array_except(df["a"], lit(array(lit("list"),lit("of"),lit("stuff"))))) == 0)
# Display result
df1.show()
+---+-----------------+
| id| a|
+---+-----------------+
| 1|[list, of, stuff]|
+---+-----------------+
I hope this helps.