How to replace null NAN or Infinite values to default value in Spark Scala

问题

I'm reading in csvs into Spark and I'm setting the schema to all DecimalType(10,0) columns. When I query the data, I get the following error:

NumberFormatException: Infinite or NaN

If I have NaN/null/infinite values in my dataframe, I would like to set them to 0. How do I do this? This is how I'm attempting to load the data:

var cases = spark.read.option("header",false).
option("nanValue","0").
option("nullValue","0").
option("positiveInf","0").
option("negativeInf","0").
schema(schema).
csv(...

Any help would be greatly appreciated.

回答1:

If you have NaN values in multiple columns, you can use na.fill() to fill with the default value

example:

  val spark =
    SparkSession.builder().master("local").appName("test").getOrCreate()

  import spark.implicits._

  val data = spark.sparkContext.parallelize(
    Seq((0f,0f, "2016-01-1"),
        (1f,1f, "2016-02-2"),
        (2f,2f, "2016-03-21"),
        (Float.NaN,Float.NaN, "2016-04-25"),
        (4f,4f, "2016-05-21"),
        (Float.NaN,Float.NaN, "2016-06-1"),
        (6f,6f, "2016-03-21"))
  ).toDF("id1", "id", "date")

data.na.fill(0).show
+---+---+----------+
|id1| id|      date|
+---+---+----------+
|0.0|0.0| 2016-01-1|
|1.0|1.0| 2016-02-2|
|2.0|2.0|      null|
|0.0|0.0|2016-04-25|
|4.0|4.0|2016-05-21|
|0.0|0.0| 2016-06-1|
|6.0|6.0|2016-03-21|
+---+---+----------+

回答2:

you can set a single dataframe column to have 0 in places of NaN using the below expression. In this example any NaN values in column col1 will be replaced with 0.

val df = (1 to 10).toDF("col1").withColumn("col1",when(when($"col1".isNull, 0).otherwise($"col1").isNaN, 0).otherwise($"col1"))

回答3:

My environment (using Spark 2.3.1 with Scala 2.11) doesn't replicate @ShankarKoirala answer - the .na.fill()… doesn't capture the infinity and NaN, because those are not empty values. However, walues could be tested using .isin() function:

val x1 = Seq((1.0, 1, "a"),(1.0, 1, "a"), (2.0, 2, "b")
           , (Float.NaN, 1, "a"), (Float.PositiveInfinity, 2, "a")
           , (Float.NegativeInfinity, 2, "a"))
        .toDF("Value", "Id", "Name")
x1
  .withColumn("IsItNull", $"Value".isNull)
  .withColumn("IsItBad", $"Value".isin(Double.NaN, Double.PositiveInfinity, Double.NegativeInfinity))
.show()

this will produce following results:

+---------+---+----+--------+-------+
|    Value| Id|Name|IsItNull|IsItBad|
+---------+---+----+--------+-------+
|      1.0|  1|   a|   false|  false|
|      1.0|  1|   a|   false|  false|
|      2.0|  2|   b|   false|  false|
|      NaN|  1|   a|   false|   true|
| Infinity|  2|   a|   false|   true|
|-Infinity|  2|   a|   false|   true|
+---------+---+----+--------+-------+

If a replacement is needed, just use original column name in the withColumn() function and apply the .isin() as argument of when function.

来源：https://stackoverflow.com/questions/44296484/how-to-replace-null-nan-or-infinite-values-to-default-value-in-spark-scala

标签

scala

apache-spark

apache-spark-sql

bigdata