I am trying to obtain all rows in a dataframe where two flags are set to \'1\' and subsequently all those that where only one of two is set to \'1\' and the other NO
To filter null values try:
foo_df = df.filter( (df.foo==1) & (df.bar.isNull()) )
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.isNull
Why is it not filtering
Because it is SQL and NULL
indicates missing values. Because of that any comparison to NULL
, other than IS NULL
and IS NOT NULL
is undefined. You need either:
col("bar").isNull() | (col("bar") != 1)
or
coalesce(col("bar") != 1, lit(True))
or (PySpark >= 2.3):
col("bar").eqNullSafe(1)
if you want null safe comparisons in PySpark.
Also 'null'
is not a valid way to introduce NULL
literal. You should use None
to indicate missing objects.
from pyspark.sql.functions import col, coalesce, lit
df = spark.createDataFrame([
('a', 1, 1), ('a',1, None), ('b', 1, 1),
('c' ,1, None), ('d', None, 1),('e', 1, 1)
]).toDF('id', 'foo', 'bar')
df.where((col("foo") == 1) & (col("bar").isNull() | (col("bar") != 1))).show()
## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## | a| 1|null|
## | c| 1|null|
## +---+---+----+
df.where((col("foo") == 1) & coalesce(col("bar") != 1, lit(True))).show()
## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## | a| 1|null|
## | c| 1|null|
## +---+---+----+