I am trying to obtain all rows in a dataframe where two flags are set to \'1\' and subsequently all those that where only one of two is set to \'1\' and the other NO
Why is it not filtering
Because it is SQL and NULL indicates missing values. Because of that any comparison to NULL, other than IS NULL and IS NOT NULL is undefined. You need either:
col("bar").isNull() | (col("bar") != 1)
or
coalesce(col("bar") != 1, lit(True))
or (PySpark >= 2.3):
col("bar").eqNullSafe(1)
if you want null safe comparisons in PySpark.
Also 'null' is not a valid way to introduce NULL literal. You should use None to indicate missing objects.
from pyspark.sql.functions import col, coalesce, lit
df = spark.createDataFrame([
('a', 1, 1), ('a',1, None), ('b', 1, 1),
('c' ,1, None), ('d', None, 1),('e', 1, 1)
]).toDF('id', 'foo', 'bar')
df.where((col("foo") == 1) & (col("bar").isNull() | (col("bar") != 1))).show()
## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## | a| 1|null|
## | c| 1|null|
## +---+---+----+
df.where((col("foo") == 1) & coalesce(col("bar") != 1, lit(True))).show()
## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## | a| 1|null|
## | c| 1|null|
## +---+---+----+