Pyspark explode json string

拜拜、爱过 提交于 2021-01-29 08:04:04

问题


Input_dataframe

id  name     collection
111 aaaaa    {"1":{"city":"city_1","state":"state_1","country":"country_1"},
              "2":{"city":"city_2","state":"state_2","country":"country_2"},
              "3":{"city":"city_3","state":"state_3","country":"country_3"}
             }
222 bbbbb    {"1":{"city":"city_1","state":"state_1","country":"country_1"},
              "2":{"city":"city_2","state":"state_2","country":"country_2"},
              "3":{"city":"city_3","state":"state_3","country":"country_3"}
              }

here

id ==> string
name ==> string
collection ==> string (string representation of JSON_data)

I want something like this

output_dataframe

id  name   key  value
111 aaaaa  "1"  {"city":"city_1","state":"state_1","country":"country_1"},
111 aaaaa  "2"  {"city":"city_2","state":"state_2","country":"country_2"},
111 aaaaa  "3"  {"city":"city_3","state":"state_3","country":"country_3"}             
222 bbbbb  "1"  {"city":"city_1","state":"state_1","country":"country_1"},
222 bbbbb  "2"  {"city":"city_2","state":"state_2","country":"country_2"},
222 bbbbb  "3"  {"city":"city_3","state":"state_3","country":"country_3"}

if my collection attribute type is either map or array then explode function will do my task. But i have collection as a string type(JSON_data)

how can i get output_dataframe?

Please let me know

NOTE collection attribute may have nested and unpredictable schema.

{
  "1":{"city":"city_1","state":"state_1","country":"country_1"},          
  "2":{"city":"city_2","state":"state_2","country":"country_2","a":  
       {"aa":"111"}},
  "3":{"city":"city_3","state":"state_3"}
             }

回答1:


you have this function from_json that will do the job. It will convert your string, then you can use explode.




回答2:


Give the json schema and get the values to the column, and I make struct column from the json.

import pyspark.sql.functions as f
from pyspark.sql.types import *

schema = StructType([
    StructField('1', StructType([
        StructField('city', StringType(), True),
        StructField('state', StringType(), True),
        StructField('country', StringType(), True),
    ]), True),
    StructField('2', StructType([
        StructField('city', StringType(), True),
        StructField('state', StringType(), True),
        StructField('country', StringType(), True),
    ]), True),
    StructField('3', StructType([
        StructField('city', StringType(), True),
        StructField('state', StringType(), True),
        StructField('country', StringType(), True),
    ]), True),
])



df2 = df.withColumn('collection', f.from_json('collection', schema))
cols = df2.select('collection.*').columns

df2.withColumn('collection', f.arrays_zip(f.array(*map(lambda x: f.lit(x), cols)), f.array('collection.*'))) \
   .withColumn('collection', f.explode('collection')) \
   .withColumn('key', f.col('collection.0')) \
   .withColumn('value', f.col('collection.1')) \
   .drop('collection').show(10, False)


+---+-----+---+----------------------------+
|id |name |key|value                       |
+---+-----+---+----------------------------+
|111|aaaaa|1  |[city_1, state_1, country_1]|
|111|aaaaa|2  |[city_2, state_2, country_2]|
|111|aaaaa|3  |[city_3, state_3, country_3]|
|222|bbbbb|1  |[city_1, state_1, country_1]|
|222|bbbbb|2  |[city_2, state_2, country_2]|
|222|bbbbb|3  |[city_3, state_3, country_3]|
+---+-----+---+----------------------------+



回答3:


Here is a hacky solution (not ideal as it uses the underlying RDD) but I have tested it on the scenario where the schema is inconsistent and it seems to be robust:

from pyspark.sql import Row

rdd1 = df.rdd

rdd1.map(lambda x: [(key, val) if key != 'collection' else (key, eval(val))
               for key, val in x.asDict().items()])
    .map(lambda x: Row(**dict(x)))
    .toDF().show()


来源:https://stackoverflow.com/questions/63518774/pyspark-explode-json-string

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!