Comparison operator in PySpark (not equal/ !=)

[亡魂溺海] 提交于 2019-12-18 15:01:13

问题


I am trying to obtain all rows in a dataframe where two flags are set to '1' and subsequently all those that where only one of two is set to '1' and the other NOT EQUAL to '1'

With the following schema (three columns),

df = sqlContext.createDataFrame([('a',1,'null'),('b',1,1),('c',1,'null'),('d','null',1),('e',1,1)], #,('f',1,'NaN'),('g','bla',1)],
                            schema=('id', 'foo', 'bar')
                            )

I obtain the following dataframe:

+---+----+----+
| id| foo| bar|
+---+----+----+
|  a|   1|null|
|  b|   1|   1|
|  c|   1|null|
|  d|null|   1|
|  e|   1|   1|
+---+----+----+

When I apply the desired filters, the first filter (foo=1 AND bar=1) works, but not the other (foo=1 AND NOT bar=1)

foobar_df = df.filter( (df.foo==1) & (df.bar==1) )

yields:

+---+---+---+
| id|foo|bar|
+---+---+---+
|  b|  1|  1|
|  e|  1|  1|
+---+---+---+

Here is the non-behaving filter:

foo_df = df.filter( (df.foo==1) & (df.bar!=1) )
foo_df.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
+---+---+---+

Why is it not filtering? How can I get the columns where only foo is equal to '1'?


回答1:


To filter null values try:

foo_df = df.filter( (df.foo==1) & (df.bar.isNull()) )

https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.isNull




回答2:


Why is it not filtering

Because it is SQL and NULL indicates missing values. Because of that any comparison to NULL, other than IS NULL and IS NOT NULL is undefined. You need either:

col("bar").isNull() | (col("bar") != 1)

or

coalesce(col("bar") != 1, lit(True))

or (PySpark >= 2.3):

col("bar").eqNullSafe(1)

if you want null safe comparisons in PySpark.

Also 'null' is not a valid way to introduce NULL literal. You should use None to indicate missing objects.

from pyspark.sql.functions import col, coalesce, lit

df = spark.createDataFrame([
    ('a', 1, 1), ('a',1, None), ('b', 1, 1),
    ('c' ,1, None), ('d', None, 1),('e', 1, 1)
]).toDF('id', 'foo', 'bar')

df.where((col("foo") == 1) & (col("bar").isNull() | (col("bar") != 1))).show()

## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## |  a|  1|null|
## |  c|  1|null|
## +---+---+----+

df.where((col("foo") == 1) & coalesce(col("bar") != 1, lit(True))).show()

## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## |  a|  1|null|
## |  c|  1|null|
## +---+---+----+


来源:https://stackoverflow.com/questions/39120934/comparison-operator-in-pyspark-not-equal

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!