Convert null values to empty array in Spark DataFrame

后端 未结 3 2094

I have a Spark data frame where one column is an array of integers. The column is nullable because it is coming from a left outer join. I want to convert all null values to

相关标签:
3条回答
  • 2020-12-01 11:20

    An UDF-free alternative to use when the data type you want your array elements in can not be cast from StringType is the following:

    import pyspark.sql.types as T
    import pyspark.sql.functions as F
    
    df.withColumn(
        "myCol",
        F.coalesce(
            F.col("myCol"),
            F.from_json(F.lit("[]"), T.ArrayType(T.IntegerType()))
        )
    )
    

    You can replace IntegerType() with whichever data type, also complex ones.

    0 讨论(0)
  • 2020-12-01 11:33

    You can use an UDF:

    import org.apache.spark.sql.functions.udf
    
    val array_ = udf(() => Array.empty[Int])
    

    combined with WHEN or COALESCE:

    df.withColumn("myCol", when(myCol.isNull, array_()).otherwise(myCol))
    df.withColumn("myCol", coalesce(myCol, array_())).show
    

    In the recent versions you can use array function:

    import org.apache.spark.sql.functions.{array, lit}
    
    df.withColumn("myCol", when(myCol.isNull, array().cast("array<integer>")).otherwise(myCol))
    df.withColumn("myCol", coalesce(myCol, array().cast("array<integer>"))).show
    

    Please note that it will work only if conversion from string to the desired type is allowed.

    The same thing can be of course done in PySpark as well. For the legacy solutions you can define udf

    from pyspark.sql.functions import udf
    from pyspark.sql.types import ArrayType, IntegerType
    
    def empty_array(t):
        return udf(lambda: [], ArrayType(t()))()
    
    coalesce(myCol, empty_array(IntegerType()))
    

    and in the recent versions just use array:

    from pyspark.sql.functions import array
    
    coalesce(myCol, array().cast("array<integer>"))
    
    0 讨论(0)
  • 2020-12-01 11:36

    With a slight modification to zero323's approach, I was able to do this without using a udf in Spark 2.3.1.

    val df = Seq("a" -> Array(1,2,3), "b" -> null, "c" -> Array(7,8,9)).toDF("id","numbers")
    df.show
    +---+---------+
    | id|  numbers|
    +---+---------+
    |  a|[1, 2, 3]|
    |  b|     null|
    |  c|[7, 8, 9]|
    +---+---------+
    
    val df2 = df.withColumn("numbers", coalesce($"numbers", array()))
    df2.show
    +---+---------+
    | id|  numbers|
    +---+---------+
    |  a|[1, 2, 3]|
    |  b|       []|
    |  c|[7, 8, 9]|
    +---+---------+
    
    0 讨论(0)
提交回复
热议问题