How to melt Spark DataFrame?

前端 未结 4 867
日久生厌
日久生厌 2020-11-22 02:57

Is there an equivalent of Pandas Melt Function in Apache Spark in PySpark or at least in Scala?

I was running a sample dataset till now in python and now I want to u

4条回答
  •  小鲜肉
    小鲜肉 (楼主)
    2020-11-22 03:14

    Came across this question in my search for an implementation of melt in Spark for Scala.

    Posting my Scala port in case someone also stumbles upon this.

    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.{DataFrame}
    /** Extends the [[org.apache.spark.sql.DataFrame]] class
     *
     *  @param df the data frame to melt
     */
    implicit class DataFrameFunctions(df: DataFrame) {
    
        /** Convert [[org.apache.spark.sql.DataFrame]] from wide to long format.
         * 
         *  melt is (kind of) the inverse of pivot
         *  melt is currently (02/2017) not implemented in spark
         *
         *  @see reshape packe in R (https://cran.r-project.org/web/packages/reshape/index.html)
         *  @see this is a scala adaptation of http://stackoverflow.com/questions/41670103/pandas-melt-function-in-apache-spark
         *  
         *  @todo method overloading for simple calling
         *
         *  @param id_vars the columns to preserve
         *  @param value_vars the columns to melt
         *  @param var_name the name for the column holding the melted columns names
         *  @param value_name the name for the column holding the values of the melted columns
         *
         */
    
        def melt(
                id_vars: Seq[String], value_vars: Seq[String], 
                var_name: String = "variable", value_name: String = "value") : DataFrame = {
    
            // Create array>
            val _vars_and_vals = array((for (c <- value_vars) yield { struct(lit(c).alias(var_name), col(c).alias(value_name)) }): _*)
    
            // Add to the DataFrame and explode
            val _tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))
    
            val cols = id_vars.map(col _) ++ { for (x <- List(var_name, value_name)) yield { col("_vars_and_vals")(x).alias(x) }}
    
            return _tmp.select(cols: _*)
    
        }
    }
    

    Since I'm am not that advanced considering Scala, I'm sure there is room for improvement.

    Any comments are welcome.

提交回复
热议问题