Pyspark explode json string

问题

Input_dataframe

id  name     collection
111 aaaaa    {"1":{"city":"city_1","state":"state_1","country":"country_1"},
              "2":{"city":"city_2","state":"state_2","country":"country_2"},
              "3":{"city":"city_3","state":"state_3","country":"country_3"}
             }
222 bbbbb    {"1":{"city":"city_1","state":"state_1","country":"country_1"},
              "2":{"city":"city_2","state":"state_2","country":"country_2"},
              "3":{"city":"city_3","state":"state_3","country":"country_3"}
              }

here

id ==> string
name ==> string
collection ==> string (string representation of JSON_data)

I want something like this

output_dataframe

id  name   key  value
111 aaaaa  "1"  {"city":"city_1","state":"state_1","country":"country_1"},
111 aaaaa  "2"  {"city":"city_2","state":"state_2","country":"country_2"},
111 aaaaa  "3"  {"city":"city_3","state":"state_3","country":"country_3"}             
222 bbbbb  "1"  {"city":"city_1","state":"state_1","country":"country_1"},
222 bbbbb  "2"  {"city":"city_2","state":"state_2","country":"country_2"},
222 bbbbb  "3"  {"city":"city_3","state":"state_3","country":"country_3"}

if my collection attribute type is either map or array then explode function will do my task. But i have collection as a string type(JSON_data)

how can i get output_dataframe?

Please let me know

NOTE collection attribute may have nested and unpredictable schema.

{
  "1":{"city":"city_1","state":"state_1","country":"country_1"},          
  "2":{"city":"city_2","state":"state_2","country":"country_2","a":  
       {"aa":"111"}},
  "3":{"city":"city_3","state":"state_3"}
             }

回答1:

you have this function from_json that will do the job. It will convert your string, then you can use explode.

回答2:

Give the json schema and get the values to the column, and I make struct column from the json.

import pyspark.sql.functions as f
from pyspark.sql.types import *

schema = StructType([
    StructField('1', StructType([
        StructField('city', StringType(), True),
        StructField('state', StringType(), True),
        StructField('country', StringType(), True),
    ]), True),
    StructField('2', StructType([
        StructField('city', StringType(), True),
        StructField('state', StringType(), True),
        StructField('country', StringType(), True),
    ]), True),
    StructField('3', StructType([
        StructField('city', StringType(), True),
        StructField('state', StringType(), True),
        StructField('country', StringType(), True),
    ]), True),
])



df2 = df.withColumn('collection', f.from_json('collection', schema))
cols = df2.select('collection.*').columns

df2.withColumn('collection', f.arrays_zip(f.array(*map(lambda x: f.lit(x), cols)), f.array('collection.*'))) \
   .withColumn('collection', f.explode('collection')) \
   .withColumn('key', f.col('collection.0')) \
   .withColumn('value', f.col('collection.1')) \
   .drop('collection').show(10, False)


+---+-----+---+----------------------------+
|id |name |key|value                       |
+---+-----+---+----------------------------+
|111|aaaaa|1  |[city_1, state_1, country_1]|
|111|aaaaa|2  |[city_2, state_2, country_2]|
|111|aaaaa|3  |[city_3, state_3, country_3]|
|222|bbbbb|1  |[city_1, state_1, country_1]|
|222|bbbbb|2  |[city_2, state_2, country_2]|
|222|bbbbb|3  |[city_3, state_3, country_3]|
+---+-----+---+----------------------------+

回答3:

Here is a hacky solution (not ideal as it uses the underlying RDD) but I have tested it on the scenario where the schema is inconsistent and it seems to be robust:

from pyspark.sql import Row

rdd1 = df.rdd

rdd1.map(lambda x: [(key, val) if key != 'collection' else (key, eval(val))
               for key, val in x.asDict().items()])
    .map(lambda x: Row(**dict(x)))
    .toDF().show()

来源：https://stackoverflow.com/questions/63518774/pyspark-explode-json-string

标签

python-3.x

dataframe

apache-spark

pyspark

apache-spark-sql