Comparing two arrays and getting the difference in PySpark

后端 未结 2 592
陌清茗
陌清茗 2020-12-19 17:31

I have two array fields in a data frame.

I have a requirement to compare these two arrays and get the difference as an array(new column) in the same data fr

相关标签:
2条回答
  • 2020-12-19 17:57

    You can use a user-defined function. My example dataframe differs a bit from yours, but the code should work fine:

    import pandas as pd
    from pyspark.sql.types import *
    
    #example df
    df=sqlContext.createDataFrame(pd.DataFrame(data=[[["hello", "world"], 
    ["world"]],[["sample", "overflow", "text"], ["sample", "text"]]], columns=["A", "B"]))
    
    # define udf
    differencer=udf(lambda x,y: list(set(x)-set(y)), ArrayType(StringType()))
    df=df.withColumn('difference', differencer('A', 'B'))
    

    EDIT:

    This does not work if there are duplicates as set retains only uniques. So you can amend the udf as follows:

    differencer=udf(lambda x,y: [elt for elt in x if elt not in y] ), ArrayType(StringType()))
    
    0 讨论(0)
  • 2020-12-19 17:57

    Since Spark 2.4.0, this can be solved easily using array_except. Taking the example

    from pyspark.sql import functions as F
    
    #example df
    df=sqlContext.createDataFrame(pd.DataFrame(data=[[["hello", "world"], 
    ["world"]],[["sample", "overflow", "text"], ["sample", "text"]]], columns=["A", "B"]))
    
    
    df=df.withColumn('difference', F.array_except('A', 'B'))
    

    for more similar operations on arrays, I suggest this blogpost https://www.waitingforcode.com/apache-spark-sql/apache-spark-2.4.0-features-array-higher-order-functions/read

    0 讨论(0)
提交回复
热议问题