How to replace null NAN or Infinite values to default value in Spark Scala

白昼怎懂夜的黑 提交于 2019-12-05 21:28:48

If you have NaN values in multiple columns, you can use na.fill() to fill with the default value

example:

  val spark =
    SparkSession.builder().master("local").appName("test").getOrCreate()

  import spark.implicits._

  val data = spark.sparkContext.parallelize(
    Seq((0f,0f, "2016-01-1"),
        (1f,1f, "2016-02-2"),
        (2f,2f, "2016-03-21"),
        (Float.NaN,Float.NaN, "2016-04-25"),
        (4f,4f, "2016-05-21"),
        (Float.NaN,Float.NaN, "2016-06-1"),
        (6f,6f, "2016-03-21"))
  ).toDF("id1", "id", "date")

data.na.fill(0).show
+---+---+----------+
|id1| id|      date|
+---+---+----------+
|0.0|0.0| 2016-01-1|
|1.0|1.0| 2016-02-2|
|2.0|2.0|      null|
|0.0|0.0|2016-04-25|
|4.0|4.0|2016-05-21|
|0.0|0.0| 2016-06-1|
|6.0|6.0|2016-03-21|
+---+---+----------+

you can set a single dataframe column to have 0 in places of NaN using the below expression. In this example any NaN values in column col1 will be replaced with 0.

val df = (1 to 10).toDF("col1").withColumn("col1",when(when($"col1".isNull, 0).otherwise($"col1").isNaN, 0).otherwise($"col1"))

My environment (using Spark 2.3.1 with Scala 2.11) doesn't replicate @ShankarKoirala answer - the .na.fill()… doesn't capture the infinity and NaN, because those are not empty values. However, walues could be tested using .isin() function:

val x1 = Seq((1.0, 1, "a"),(1.0, 1, "a"), (2.0, 2, "b")
           , (Float.NaN, 1, "a"), (Float.PositiveInfinity, 2, "a")
           , (Float.NegativeInfinity, 2, "a"))
        .toDF("Value", "Id", "Name")
x1
  .withColumn("IsItNull", $"Value".isNull)
  .withColumn("IsItBad", $"Value".isin(Double.NaN, Double.PositiveInfinity, Double.NegativeInfinity))
.show()

this will produce following results:

+---------+---+----+--------+-------+
|    Value| Id|Name|IsItNull|IsItBad|
+---------+---+----+--------+-------+
|      1.0|  1|   a|   false|  false|
|      1.0|  1|   a|   false|  false|
|      2.0|  2|   b|   false|  false|
|      NaN|  1|   a|   false|   true|
| Infinity|  2|   a|   false|   true|
|-Infinity|  2|   a|   false|   true|
+---------+---+----+--------+-------+

If a replacement is needed, just use original column name in the withColumn() function and apply the .isin() as argument of when function.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!