pyspark - create DataFrame Grouping columns in map type structure

∥☆過路亽.° 提交于 2019-12-04 04:22:41

Slight improvement to Prem's answer (sorry I can't comment yet)

Use func.create_map instead of func.struct. See documentation

import pyspark.sql.functions as func
df = sc.parallelize([('B','a',10),('B','b',20),
('C','c',30)]).toDF(['Brand','Type','Amount'])

df_converted = df.groupBy("Brand").\
    agg(func.collect_list(func.create_map(func.col("Type"),
    func.col("Amount"))).alias("MAP_type_AMOUNT"))

print df_converted.collect()

Output:

[Row(Brand=u'B', MAP_type_AMOUNT=[{u'a': 10}, {u'b': 20}]),
 Row(Brand=u'C', MAP_type_AMOUNT=[{u'c': 30}])]

You can have something like below but not exactly 'Map'

import pyspark.sql.functions as func
df = sc.parallelize([('B','a',10),('B','b',20),('C','c',30)]).toDF(['Brand','Type','Amount'])

df_converted = df.groupBy("Brand").\
    agg(func.collect_list(func.struct(func.col("Type"), func.col("Amount"))).alias("MAP_type_AMOUNT"))
df_converted.show()

Output is:

+-----+----------------+
|Brand| MAP_type_AMOUNT|
+-----+----------------+
|    B|[[a,10], [b,20]]|
|    C|        [[c,30]]|
+-----+----------------+

Hope this helps!

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!