Converting multiple different columns to Map column with Spark Dataframe scala

前端 未结 1 373
渐次进展
渐次进展 2020-12-16 18:34

I have a data frame with column: user, address1, address2, address3, phone1, phone2 and so on. I want to convert this data frame to - user, address, phon

相关标签:
1条回答
  • 2020-12-16 19:14

    Spark >= 2.0

    You can skip udf and use map (create_map in Python) SQL function:

    import org.apache.spark.sql.functions.map
    
    df.select(
      map(mapData.map(c => lit(c) :: col(c) :: Nil).flatten: _*).alias("a_map")
    )
    

    Spark < 2.0

    As far as I know there is no direct way to do it. You can use an UDF like this:

    import org.apache.spark.sql.functions.{udf, array, lit, col}
    
    val df = sc.parallelize(Seq(
      (1L, "addr1", "addr2", "addr3")
    )).toDF("user", "address1", "address2", "address3")
    
    val asMap = udf((keys: Seq[String], values: Seq[String]) => 
      keys.zip(values).filter{
        case (k, null) => false
        case _ => true
      }.toMap)
    
    val keys = array(mapData.map(lit): _*)
    val values = array(mapData.map(col): _*)
    
    val dfWithMap = df.withColumn("address", asMap(keys, values))
    

    Another option, which doesn't require UDFs, is to struct field instead of map:

    val dfWithStruct = df.withColumn("address", struct(mapData.map(col): _*))
    

    The biggest advantage is that it can easily handle values of different types.

    0 讨论(0)
提交回复
热议问题