Differences between null and NaN in spark? How to deal with it?

前端 未结 3 1010
失恋的感觉
失恋的感觉 2020-12-14 07:32

In my DataFrame, there are columns including values of null and NaN respectively, such as:

df = spark.createDataFrame([(1, float(\'nan\')), (None, 1.0)], (\         


        
3条回答
  •  粉色の甜心
    2020-12-14 07:54

    null values represents "no value" or "nothing", it's not even an empty string or zero. It can be used to represent that nothing useful exists.

    NaN stands for "Not a Number", it's usually the result of a mathematical operation that doesn't make sense, e.g. 0.0/0.0.

    One possible way to handle null values is to remove them with:

    df.na.drop()
    

    Or you can change them to an actual value (here I used 0) with:

    df.na.fill(0)
    

    Another way would be to select the rows where a specific column is null for further processing:

    df.where(col("a").isNull())
    df.where(col("a").isNotNull())
    

    Rows with NaN can also be selected using the equivalent method:

    from pyspark.sql.functions import isnan
    df.where(isnan(col("a")))
    

提交回复
热议问题