PySpark DataFrames - filtering using comparisons between columns of different types

问题

Suppose you have a dataframe with columns of various types (string, double...) and a special value "miss" that represents "missing value" in string-typed columns.

from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder.getOrCreate()

pdf = pd.DataFrame([
    [1, 'miss'],
    [2, 'x'],
    [None, 'y']
], columns=['intcol', 'strcol'])

df = spark.createDataFrame(data=pdf)

I am trying to count the number of non-missing values for each column, using filtering like this:

col = df['strcol']
df.filter(col.isNotNull() & (col != 'miss')).show()

Which works for the string column:

+------+------+
|intcol|strcol|
+------+------+
|   2.0|     x|
|   NaN|     y|
+------+------+

However, for the numeric column, it filters out all the rows:

col = df['intcol']
df.filter(col.isNotNull() & (col != 'miss')).show()

+------+------+
|intcol|strcol|
+------+------+
+------+------+

It seems like this is because the cross-type comparison of the numeric column with string value results in all-null values:

df.select(df['intcol'] != 'miss').show()

+---------------------+
|(NOT (intcol = miss))|
+---------------------+
|                 null|
|                 null|
|                 null|
+---------------------+

Which I find a bit unexpected (e.g. 1 != '' is True, not null in "normal" Python)

My question is really several questions:

why does the cross type comparison results in nulls?
what is the best way to test for equality/non-equality across different types in the "expected way"? Or (in my case) do I need to include separate logic that switches based on the type of the column?
It seems like df.filter(~df['intcol'].isin(['miss'])) does the job, but I wonder if that is less efficient?

回答1:

Let's start with why. DataFrame API is a DSL for SQL and SQL evaluation rules apply. Whenever you apply an operator on objects of different types, CAST operation is applied, according to predefined rules, on an operand of lower precedence. In general numeric types, have higher precedence, therefore (following the execution plan df.select(df['intcol'] != 'miss').explain(True)):

== Parsed Logical Plan ==
'Project [NOT (intcol#0 = miss) AS (NOT (intcol = miss))#12]
+- LogicalRDD [intcol#0, strcol#1], false

is rewritten as

== Analyzed Logical Plan ==
(NOT (intcol = miss)): boolean
Project [NOT (intcol#0 = cast(miss as double)) AS (NOT (intcol = miss))#12]
+- LogicalRDD [intcol#0, strcol#1], false

where 'miss' is CASTED to double, and later converted to NULL

== Optimized Logical Plan ==
Project [null AS (NOT (intcol = miss))#22]
+- LogicalRDD [intcol#0, strcol#1], false

as cast with this operand is undefined.

Since equality with NULL is undefined as well - Difference between === null and isNull in Spark DataDrame - filter yields an empty result.

Now how to address that. Both explicit casting:

df.filter(df['intcol'].cast("string") != 'miss')

and null safe equality:

df.filter(~df['intcol'].cast("string").eqNullSafe('miss'))

should do the trick.

Also please note that NaN values are not NULL and conversion via Pandas is lossy - Pandas dataframe to Spark dataframe, handling NaN conversions to actual null?

来源：https://stackoverflow.com/questions/54458070/pyspark-dataframes-filtering-using-comparisons-between-columns-of-different-ty

标签

python

apache-spark

pyspark

apache-spark-sql