问题
Suppose you have a dataframe with columns of various types (string, double...) and a special value "miss" that represents "missing value" in string-typed columns.
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.getOrCreate()
pdf = pd.DataFrame([
[1, 'miss'],
[2, 'x'],
[None, 'y']
], columns=['intcol', 'strcol'])
df = spark.createDataFrame(data=pdf)
I am trying to count the number of non-missing values for each column, using filtering like this:
col = df['strcol']
df.filter(col.isNotNull() & (col != 'miss')).show()
Which works for the string column:
+------+------+
|intcol|strcol|
+------+------+
| 2.0| x|
| NaN| y|
+------+------+
However, for the numeric column, it filters out all the rows:
col = df['intcol']
df.filter(col.isNotNull() & (col != 'miss')).show()
+------+------+
|intcol|strcol|
+------+------+
+------+------+
It seems like this is because the cross-type comparison of the numeric column with string value results in all-null values:
df.select(df['intcol'] != 'miss').show()
+---------------------+
|(NOT (intcol = miss))|
+---------------------+
| null|
| null|
| null|
+---------------------+
Which I find a bit unexpected (e.g. 1 != ''
is True, not null in "normal" Python)
My question is really several questions:
- why does the cross type comparison results in nulls?
- what is the best way to test for equality/non-equality across different types in the "expected way"? Or (in my case) do I need to include separate logic that switches based on the type of the column?
- It seems like
df.filter(~df['intcol'].isin(['miss']))
does the job, but I wonder if that is less efficient?
回答1:
Let's start with why. DataFrame
API is a DSL for SQL and SQL evaluation rules apply. Whenever you apply an operator on objects of different types, CAST
operation is applied, according to predefined rules, on an operand of lower precedence. In general numeric types, have higher precedence, therefore (following the execution plan df.select(df['intcol'] != 'miss').explain(True)
):
== Parsed Logical Plan ==
'Project [NOT (intcol#0 = miss) AS (NOT (intcol = miss))#12]
+- LogicalRDD [intcol#0, strcol#1], false
is rewritten as
== Analyzed Logical Plan ==
(NOT (intcol = miss)): boolean
Project [NOT (intcol#0 = cast(miss as double)) AS (NOT (intcol = miss))#12]
+- LogicalRDD [intcol#0, strcol#1], false
where 'miss'
is CASTED
to double
, and later converted to NULL
== Optimized Logical Plan ==
Project [null AS (NOT (intcol = miss))#22]
+- LogicalRDD [intcol#0, strcol#1], false
as cast with this operand is undefined.
Since equality with NULL
is undefined as well - Difference between === null and isNull in Spark DataDrame - filter
yields an empty result.
Now how to address that. Both explicit casting:
df.filter(df['intcol'].cast("string") != 'miss')
and null safe equality:
df.filter(~df['intcol'].cast("string").eqNullSafe('miss'))
should do the trick.
Also please note that NaN
values are not NULL
and conversion via Pandas is lossy - Pandas dataframe to Spark dataframe, handling NaN conversions to actual null?
来源:https://stackoverflow.com/questions/54458070/pyspark-dataframes-filtering-using-comparisons-between-columns-of-different-ty