how to store grouped data into json in pyspark

我的未来我决定 提交于 2019-12-04 15:01:21

If your joined dataframe looks like this:

gender  age
M   5
F   50
M   10
M   10
F   10

You can then use below code to get desired output

joinedDF.groupBy("gender") \ 
    .agg(collect_list("age").alias("ages")) \
    .write.json("jsonOutput.txt")

Output would look like below:

{"gender":"F","ages":[50,10]}
{"gender":"M","ages":[5,10,10]}

In case you have multiple columns like name, salary. You can add columns like below:

df.groupBy("gender")
    .agg(collect_list("age").alias("ages"),collect_list("name").alias("names"))

Your output would look like:

{"gender":"F","ages":[50,10],"names":["ankit","abhay"]}
{"gender":"M","ages":[5,10,10],"names":["snchit","mohit","rohit"]}

You cannot use GroupedData directly. It has to be aggregated first. It could be partially covered by aggregation with built-in functions like collect_list but it is simply not possible to achieve desired output, with values used to represent keys, using DataFrameWriter.

In can try something like this instead:

from pyspark.sql import Row
import json

def make_json(kvs):
  k, vs = kvs
  return json.dumps({k[0]: list(vs)})

(df.select(struct(*keys), values)
    .rdd
    .mapValues(Row.asDict)
    .groupByKey()
    .map(make_json))

and saveAsTextFile.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!