Filter by whether column value equals a list in Spark

后端 未结 3 1201
一个人的身影
一个人的身影 2020-12-06 06:38

I\'m trying to filter a Spark dataframe based on whether the values in a column equal a list. I would like to do something like this:

filtered_df = df.where(         


        
3条回答
  •  遥遥无期
    2020-12-06 07:16

    You can use a combination of "array", "lit" and "array_except" function to achieve this.

    1. We create an array column using lit(array(lit("list"),lit("of"),lit("stuff"))
    2. Then we used array_exept function to get the values present in first array and not present in second array.
    3. Then we filter for empty result array which means all the elements in first array are same as of ["list", "of", "stuff"]

    Note: array_except function is available from spark 2.4.0.

    Here is the code:

    # Import libraries
    from pyspark.sql.functions import *
    
    # Create DataFrame
    df = sc.parallelize([
        (1, ['list','of' , 'stuff']),
        (2, ['foo', 'bar']),
        (3, ['foobar']),
        (4, ['list','of' , 'stuff', 'and', 'foo']),
        (5, ['a', 'list','of' , 'stuff']),
    ]).toDF(['id', 'a'])
    
    # Solution
    df1 = df.filter(size(array_except(df["a"], lit(array(lit("list"),lit("of"),lit("stuff"))))) == 0)
    
    # Display result
    df1.show() 
    

    Output

    +---+-----------------+
    | id|                a|
    +---+-----------------+
    |  1|[list, of, stuff]|
    +---+-----------------+
    

    I hope this helps.

提交回复
热议问题