I\'m trying to filter a PySpark dataframe that has None as a row value:
df.select(\'dt_mvmt\').distinct().collect()
[Row(dt_mvmt=u\'2016-03-27\
You can use Column.isNull / Column.isNotNull:
df.where(col("dt_mvmt").isNull())
df.where(col("dt_mvmt").isNotNull())
If you want to simply drop NULL values you can use na.drop with subset argument:
df.na.drop(subset=["dt_mvmt"])
Equality based comparisons with NULL won't work because in SQL NULL is undefined so any attempt to compare it with another value returns NULL:
sqlContext.sql("SELECT NULL = NULL").show()
## +-------------+
## |(NULL = NULL)|
## +-------------+
## | null|
## +-------------+
sqlContext.sql("SELECT NULL != NULL").show()
## +-------------------+
## |(NOT (NULL = NULL))|
## +-------------------+
## | null|
## +-------------------+
The only valid method to compare value with NULL is IS / IS NOT which are equivalent to the isNull / isNotNull method calls.