How to use groupBy to collect rows into a map?

前端 未结 3 784
轮回少年
轮回少年 2020-12-08 17:46

Context

sqlContext.sql(s\"\"\"
SELECT
school_name,
name,
age
FROM my_table
\"\"\")

Ask

Given the

相关标签:
3条回答
  • 2020-12-08 18:10

    As of spark 2.4 you can use map_from_arrays function to achieve this.

    val df = spark.sql(s"""
        SELECT *
        FROM VALUES ('s1','a',1),('s1','b',2),('s2','a',1)
        AS (school, name, age)
    """)
    
    val df2 = df.groupBy("school").agg(map_from_arrays(collect_list($"name"), collect_list($"age")).as("map"))
    
    
    
    +------+----+---+
    |school|name|age|
    +------+----+---+
    |    s1|   a|  1|
    |    s1|   b|  2|
    |    s2|   a|  1|
    +------+----+---+
    
    +------+----------------+
    |school|             map|
    +------+----------------+
    |    s2|        [a -> 1]|
    |    s1|[a -> 1, b -> 2]|
    +------+----------------+
    
    0 讨论(0)
  • 2020-12-08 18:11

    Following will work with Spark 2.0. You can use map function available since 2.0 release to get columns as Map.

    val df1 = df.groupBy(col("school_name")).agg(collect_list(map($"name",$"age")) as "map")
    df1.show(false)
    

    This will give you below output.

    +-----------+------------------------------------+
    |school_name|map                                 |
    +-----------+------------------------------------+
    |school B   |[Map(cathy -> 10), Map(shaun -> 5)] |
    |school A   |[Map(michael -> 7), Map(emily -> 5)]|
    +-----------+------------------------------------+
    

    Now you can use UDF to join individual Maps into single Map like below.

    import org.apache.spark.sql.functions.udf
    val joinMap = udf { values: Seq[Map[String,Int]] => values.flatten.toMap }
    
    val df2 = df1.withColumn("map", joinMap(col("map")))
    df2.show(false)
    

    This will give required output with Map[String,Int].

    +-----------+-----------------------------+
    |school_name|map                          |
    +-----------+-----------------------------+
    |school B   |Map(cathy -> 10, shaun -> 5) |
    |school A   |Map(michael -> 7, emily -> 5)|
    +-----------+-----------------------------+
    

    If you want to convert a column value into JSON String then Spark 2.1.0 has introduced to_json function.

    val df3 = df2.withColumn("map",to_json(struct($"map")))
    df3.show(false)
    

    The to_json function will return following output.

    +-----------+-------------------------------+
    |school_name|map                            |
    +-----------+-------------------------------+
    |school B   |{"map":{"cathy":10,"shaun":5}} |
    |school A   |{"map":{"michael":7,"emily":5}}|
    +-----------+-------------------------------+
    
    0 讨论(0)
  • 2020-12-08 18:12
    df.select($"school_name",concat_ws(":",$"age",$"name").as("new_col")).groupBy($"school_name").agg(collect_set($"new_col")).show
    +-----------+--------------------+                                              
    |school_name|collect_set(new_col)|
    +-----------+--------------------+
    |   school B| [5:shaun, 10:cathy]|
    |   school A|[7:michael, 5:emily]|
    +-----------+--------------------+
    
    0 讨论(0)
提交回复
热议问题