How to melt Spark DataFrame?

前端 未结 4 826
日久生厌
日久生厌 2020-11-22 02:57

Is there an equivalent of Pandas Melt Function in Apache Spark in PySpark or at least in Scala?

I was running a sample dataset till now in python and now I want to u

4条回答
  •  春和景丽
    2020-11-22 03:32

    Voted for user6910411's answer. It works as expected, however, it cannot handle None values well. thus I refactored his melt function to the following:

    from pyspark.sql.functions import array, col, explode, lit
    from pyspark.sql.functions import create_map
    from pyspark.sql import DataFrame
    from typing import Iterable 
    from itertools import chain
    
    def melt(
            df: DataFrame, 
            id_vars: Iterable[str], value_vars: Iterable[str], 
            var_name: str="variable", value_name: str="value") -> DataFrame:
        """Convert :class:`DataFrame` from wide to long format."""
    
        # Create map
        _vars_and_vals = create_map(
            list(chain.from_iterable([
                [lit(c), col(c)] for c in value_vars]
            ))
        )
    
        _tmp = df.select(*id_vars, explode(_vars_and_vals)) \
            .withColumnRenamed('key', var_name) \
            .withColumnRenamed('value', value_name)
    
        return _tmp
    

    Test is with the following dataframe:

    import pandas as pd
    
    pdf = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
                       'B': {0: 1, 1: 3, 2: 5},
                       'C': {0: 2, 1: 4, 2: 6},
                       'D': {1: 7, 2: 9}})
    
    pd.melt(pdf, id_vars=['A'], value_vars=['B', 'C', 'D'])
    
    A   variable    value
    0   a   B   1.0
    1   b   B   3.0
    2   c   B   5.0
    3   a   C   2.0
    4   b   C   4.0
    5   c   C   6.0
    6   a   D   NaN
    7   b   D   7.0
    8   c   D   9.0
    
    
    sdf = spark.createDataFrame(pdf)
    melt(sdf, id_vars=['A'], value_vars=['B', 'C', 'D']).show()
    +---+--------+-----+
    |  A|variable|value|
    +---+--------+-----+
    |  a|       B|  1.0|
    |  a|       C|  2.0|
    |  a|       D|  NaN|
    |  b|       B|  3.0|
    |  b|       C|  4.0|
    |  b|       D|  7.0|
    |  c|       B|  5.0|
    |  c|       C|  6.0|
    |  c|       D|  9.0|
    +---+--------+-----+
    

提交回复
热议问题