问题
I have a dataset:
number | matricule<array> | name<array> | model <array>
----------------------------------------------------------------
AA | [] | [7] | [7]
----------------------------------------------------------------
AA | [9] | [4] | [9]
----------------------------------------------------------------
AA | [8] | [2] | [8, 2]
----------------------------------------------------------------
AA | [2] | [3, 4] | [3,4]
I would like to add a new column "Falg" that contain true or false according the comparison result.
Comparison rule:
if model column contain all values of name columns and not contain matricule array ==> Flag = True
else false.
In case if model contain matricule and contain name (like in line 3 in the dataframe example) ==> Flag = False.
I tried by this code bellow:
def is_subset(a, b):
if F.size(F.array_except(a, b)) == 0:
return "True"
else:
return "False"
validate = F.udf(is_subset)
df_join = df_join.withColumn("Flag", validate(F.col("name"), F.col("model")))
return df_join
I got an error:
Caused by: org.apache.spark.SparkException: Job aborted due to stage
in is_subset
AttributeError: 'NoneType' object has no attribute '_jvm'
I tried by this code:
df_join = df_join.withColumn("Flag", F.when(
((F.size(F.expr(F.array_except(F.col("name"), F.col("model"))) == F.lit(0)))
& (F.size(F.expr(F.array_except(F.col("matricule"), F.col("model"))) != F.lit(0)))) , True
).otherwise(False))
I got this error:
TypeError: 'Column' object is not callable in this line ((F.size(F.expr(F.array_except(F.col("name"), F.col("model"))) == F.lit(0)))
Expected result:
number | matricule<array> | name<array> | model <array> | Flag
---------------------------------------------------------------------------
AA | [] | [7] | [7] | True
---------------------------------------------------------------------------
AA | [9] | [4] | [9] | False
---------------------------------------------------------------------------
AA | [8] | [2] | [8, 2] | False
---------------------------------------------------------------------------
AA | [2] | [3, 4] | [3,4] | True
Someone please can suggest me a solution ?
Thank you
回答1:
If you dont want to preserve duplicates:
from pyspark.sql import functions as F
df.withColumn("nm", F.array_intersect("name","model"))\
.withColumn("nm1", F.array_intersect("matriculate","model"))\
.withColumn("flag2", F.when((F.col("nm1")==F.col("matriculate"))&(F.size("matriculate")!=0),F.lit(False)).otherwise(F.lit(True)))\
.withColumn("Flag", F.when((F.col("flag2")==F.lit(True))&(F.col("nm")!=F.col("name")), F.lit(False)).otherwise(F.col("flag2")))\
.drop("nm","nm1","flag2").show()
+------+-----------+------+------+-----+
|number|matriculate| name| model| Flag|
+------+-----------+------+------+-----+
| AA| []| [7]| [7]| true|
| AA| [9]| [4]| [9]|false|
| AA| [8]| [2]|[8, 2]|false|
| AA| [2]|[3, 4]|[3, 4]| true|
+------+-----------+------+------+-----+
If you want to preserve duplicates:
Have to use udf here as array_intersect does not preserve duplicates
def intersect(ar1,ar2):
return [i for i in ar1 if i in ar2]
udf1= F.udf(intersect, ArrayType(LongType()))
df.withColumn("nm", udf1("name","model"))\
.withColumn("nm1", udf1("matriculate","model"))\
.withColumn("flag2", F.when((F.col("nm1")==F.col("matriculate"))&(F.size("matriculate")!=0),F.lit(False)).otherwise(F.lit(True)))\
.withColumn("Flag", F.when((F.col("flag2")==F.lit(True))&(F.col("nm")!=F.col("name")), F.lit(False)).otherwise(F.col("flag2")))\
.drop("nm","nm1","flag2").show()
+------+-----------+------+------+-----+
|number|matriculate| name| model| Flag|
+------+-----------+------+------+-----+
| AA| []| [7]| [7]| true|
| AA| [9]| [4]| [9]|false|
| AA| [8]| [2]|[8, 2]|false|
| AA| [2]|[3, 4]|[3, 4]| true|
+------+-----------+------+------+-----+
来源:https://stackoverflow.com/questions/60212751/check-if-array-contain-an-array