Comparison operator in PySpark (not equal/ !=)

后端 未结 2 1769
轮回少年
轮回少年 2020-12-31 04:43

I am trying to obtain all rows in a dataframe where two flags are set to \'1\' and subsequently all those that where only one of two is set to \'1\' and the other NO

相关标签:
2条回答
  • 2020-12-31 05:01

    To filter null values try:

    foo_df = df.filter( (df.foo==1) & (df.bar.isNull()) )

    https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.isNull

    0 讨论(0)
  • 2020-12-31 05:02

    Why is it not filtering

    Because it is SQL and NULL indicates missing values. Because of that any comparison to NULL, other than IS NULL and IS NOT NULL is undefined. You need either:

    col("bar").isNull() | (col("bar") != 1)
    

    or

    coalesce(col("bar") != 1, lit(True))
    

    or (PySpark >= 2.3):

    col("bar").eqNullSafe(1)
    

    if you want null safe comparisons in PySpark.

    Also 'null' is not a valid way to introduce NULL literal. You should use None to indicate missing objects.

    from pyspark.sql.functions import col, coalesce, lit
    
    df = spark.createDataFrame([
        ('a', 1, 1), ('a',1, None), ('b', 1, 1),
        ('c' ,1, None), ('d', None, 1),('e', 1, 1)
    ]).toDF('id', 'foo', 'bar')
    
    df.where((col("foo") == 1) & (col("bar").isNull() | (col("bar") != 1))).show()
    
    ## +---+---+----+
    ## | id|foo| bar|
    ## +---+---+----+
    ## |  a|  1|null|
    ## |  c|  1|null|
    ## +---+---+----+
    
    df.where((col("foo") == 1) & coalesce(col("bar") != 1, lit(True))).show()
    
    ## +---+---+----+
    ## | id|foo| bar|
    ## +---+---+----+
    ## |  a|  1|null|
    ## |  c|  1|null|
    ## +---+---+----+
    
    0 讨论(0)
提交回复
热议问题