Explode (transpose?) multiple columns in Spark SQL table

前端 未结 2 1539
深忆病人
深忆病人 2020-11-27 04:56

I am using Spark SQL (I mention that it is in Spark in case that affects the SQL syntax - I\'m not familiar enough to be sure yet) and I have a table that I am trying to re-

2条回答
  •  自闭症患者
    2020-11-27 05:27

    Spark >= 2.4

    You can skip zip udf and use arrays_zip function:

    df.withColumn("vars", explode(arrays_zip($"varA", $"varB"))).select(
      $"userId", $"someString",
      $"vars.varA", $"vars.varB").show
    

    Spark < 2.4

    What you want is not possible without a custom UDF. In Scala you could do something like this:

    val data = sc.parallelize(Seq(
        """{"userId": 1, "someString": "example1",
            "varA": [0, 2, 5], "varB": [1, 2, 9]}""",
        """{"userId": 2, "someString": "example2",
            "varA": [1, 20, 5], "varB": [9, null, 6]}"""
    ))
    
    val df = spark.read.json(data)
    
    df.printSchema
    // root
    //  |-- someString: string (nullable = true)
    //  |-- userId: long (nullable = true)
    //  |-- varA: array (nullable = true)
    //  |    |-- element: long (containsNull = true)
    //  |-- varB: array (nullable = true)
    //  |    |-- element: long (containsNull = true)
    

    Now we can define zip udf:

    import org.apache.spark.sql.functions.{udf, explode}
    
    val zip = udf((xs: Seq[Long], ys: Seq[Long]) => xs.zip(ys))
    
    df.withColumn("vars", explode(zip($"varA", $"varB"))).select(
       $"userId", $"someString",
       $"vars._1".alias("varA"), $"vars._2".alias("varB")).show
    
    // +------+----------+----+----+
    // |userId|someString|varA|varB|
    // +------+----------+----+----+
    // |     1|  example1|   0|   1|
    // |     1|  example1|   2|   2|
    // |     1|  example1|   5|   9|
    // |     2|  example2|   1|   9|
    // |     2|  example2|  20|null|
    // |     2|  example2|   5|   6|
    // +------+----------+----+----+
    

    With raw SQL:

    sqlContext.udf.register("zip", (xs: Seq[Long], ys: Seq[Long]) => xs.zip(ys))
    df.registerTempTable("df")
    
    sqlContext.sql(
      """SELECT userId, someString, explode(zip(varA, varB)) AS vars FROM df""")
    

提交回复
热议问题