问题
Is it possible in pyspark to create dictionary within groupBy.agg()
? Here is a toy example:
import pyspark
from pyspark.sql import Row
import pyspark.sql.functions as F
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
toy_data = spark.createDataFrame([
Row(id=1, key='a', value="123"),
Row(id=1, key='b', value="234"),
Row(id=1, key='c', value="345"),
Row(id=2, key='a', value="12"),
Row(id=2, key='x', value="23"),
Row(id=2, key='y', value="123")])
toy_data.show()
+---+---+-----+
| id|key|value|
+---+---+-----+
| 1| a| 123|
| 1| b| 234|
| 1| c| 345|
| 2| a| 12|
| 2| x| 23|
| 2| y| 123|
+---+---+-----+
and this is the expected output:
---+------------------------------------
id | key_value
---+------------------------------------
1 | {"a": "123", "b": "234", "c": "345"}
2 | {"a": "12", "x": "23", "y": "123"}
---+------------------------------------
======================================
I tried this but doesn't work.
toy_data.groupBy("id").agg(
F.create_map(col("key"),col("value")).alias("key_value")
)
This yields the following error:
AnalysisException: u"expression '`key`' is neither present in the group by, nor is it an aggregate function....
回答1:
The agg
component has to contain actual aggregation function. One way to approach this is to combine collect_list
Aggregate function: returns a list of objects with duplicates.
struct:
Creates a new struct column.
and map_from_entries
Collection function: Returns a map created from the given array of entries.
This is how you'd do that:
toy_data.groupBy("id").agg(
F.map_from_entries(
F.collect_list(
F.struct("key", "value"))).alias("key_value")
).show(truncate=False)
+---+------------------------------+
|id |key_value |
+---+------------------------------+
|1 |[a -> 123, b -> 234, c -> 345]|
|2 |[a -> 12, x -> 23, y -> 123] |
+---+------------------------------+
回答2:
For pyspark < 2.4.0 where pyspark.sql.functions.map_from_entries
is not available you can use own created udf function
import pyspark.sql.functions as F
from pyspark.sql.types import MapType, StringType
@F.udf(returnType=MapType(StringType(), StringType()))
def map_array(column):
return dict(column)
(toy_data.groupBy("id")
.agg(F.collect_list(F.struct("key", "value")).alias("key_value"))
.withColumn('key_value', map_array('key_value'))
.show(truncate=False))
+---+------------------------------+
|id |key_value |
+---+------------------------------+
|1 |[a -> 123, b -> 234, c -> 345]|
|2 |[x -> 23, a -> 12, y -> 123] |
+---+------------------------------+
来源:https://stackoverflow.com/questions/55308482/pyspark-create-dictionary-within-groupby