My DataFrame has the following structure:
-------------------------
| Brand | type | amount|
-------------------------
| B | a | 10 |
| B | b | 20 |
| C | c | 30 |
-------------------------
I want to reduce the amount of rows by grouping type
and amount
into one single column of type: Map
So Brand
will be unique and MAP_type_AMOUNT
will have key,value
for each type
amount
combination.
I think Spark.sql might have some functions to help in this process, or do I have to get the RDD being the DataFrame and make my "own" conversion to map type?
Expected:
-------------------------
| Brand | MAP_type_AMOUNT
-------------------------
| B | {a: 10, b:20} |
| C | {c: 30} |
-------------------------
Slight improvement to Prem's answer (sorry I can't comment yet)
Use func.create_map
instead of func.struct
. See documentation
import pyspark.sql.functions as func
df = sc.parallelize([('B','a',10),('B','b',20),
('C','c',30)]).toDF(['Brand','Type','Amount'])
df_converted = df.groupBy("Brand").\
agg(func.collect_list(func.create_map(func.col("Type"),
func.col("Amount"))).alias("MAP_type_AMOUNT"))
print df_converted.collect()
Output:
[Row(Brand=u'B', MAP_type_AMOUNT=[{u'a': 10}, {u'b': 20}]),
Row(Brand=u'C', MAP_type_AMOUNT=[{u'c': 30}])]
You can have something like below but not exactly 'Map'
import pyspark.sql.functions as func
df = sc.parallelize([('B','a',10),('B','b',20),('C','c',30)]).toDF(['Brand','Type','Amount'])
df_converted = df.groupBy("Brand").\
agg(func.collect_list(func.struct(func.col("Type"), func.col("Amount"))).alias("MAP_type_AMOUNT"))
df_converted.show()
Output is:
+-----+----------------+
|Brand| MAP_type_AMOUNT|
+-----+----------------+
| B|[[a,10], [b,20]]|
| C| [[c,30]]|
+-----+----------------+
Hope this helps!
来源:https://stackoverflow.com/questions/45532183/pyspark-create-dataframe-grouping-columns-in-map-type-structure