问题
Input_dataframe
id name collection
111 aaaaa {"1":{"city":"city_1","state":"state_1","country":"country_1"},
"2":{"city":"city_2","state":"state_2","country":"country_2"},
"3":{"city":"city_3","state":"state_3","country":"country_3"}
}
222 bbbbb {"1":{"city":"city_1","state":"state_1","country":"country_1"},
"2":{"city":"city_2","state":"state_2","country":"country_2"},
"3":{"city":"city_3","state":"state_3","country":"country_3"}
}
here
id ==> string
name ==> string
collection ==> string (string representation of JSON_data)
I want something like this
output_dataframe
id name key value
111 aaaaa "1" {"city":"city_1","state":"state_1","country":"country_1"},
111 aaaaa "2" {"city":"city_2","state":"state_2","country":"country_2"},
111 aaaaa "3" {"city":"city_3","state":"state_3","country":"country_3"}
222 bbbbb "1" {"city":"city_1","state":"state_1","country":"country_1"},
222 bbbbb "2" {"city":"city_2","state":"state_2","country":"country_2"},
222 bbbbb "3" {"city":"city_3","state":"state_3","country":"country_3"}
if my collection
attribute type is either map
or array
then explode
function will do my task. But i have collection
as a string type(JSON_data)
how can i get output_dataframe?
Please let me know
NOTE collection attribute may have nested and unpredictable schema.
{
"1":{"city":"city_1","state":"state_1","country":"country_1"},
"2":{"city":"city_2","state":"state_2","country":"country_2","a":
{"aa":"111"}},
"3":{"city":"city_3","state":"state_3"}
}
回答1:
you have this function from_json that will do the job. It will convert your string, then you can use explode.
回答2:
Give the json schema and get the values to the column, and I make struct column from the json.
import pyspark.sql.functions as f
from pyspark.sql.types import *
schema = StructType([
StructField('1', StructType([
StructField('city', StringType(), True),
StructField('state', StringType(), True),
StructField('country', StringType(), True),
]), True),
StructField('2', StructType([
StructField('city', StringType(), True),
StructField('state', StringType(), True),
StructField('country', StringType(), True),
]), True),
StructField('3', StructType([
StructField('city', StringType(), True),
StructField('state', StringType(), True),
StructField('country', StringType(), True),
]), True),
])
df2 = df.withColumn('collection', f.from_json('collection', schema))
cols = df2.select('collection.*').columns
df2.withColumn('collection', f.arrays_zip(f.array(*map(lambda x: f.lit(x), cols)), f.array('collection.*'))) \
.withColumn('collection', f.explode('collection')) \
.withColumn('key', f.col('collection.0')) \
.withColumn('value', f.col('collection.1')) \
.drop('collection').show(10, False)
+---+-----+---+----------------------------+
|id |name |key|value |
+---+-----+---+----------------------------+
|111|aaaaa|1 |[city_1, state_1, country_1]|
|111|aaaaa|2 |[city_2, state_2, country_2]|
|111|aaaaa|3 |[city_3, state_3, country_3]|
|222|bbbbb|1 |[city_1, state_1, country_1]|
|222|bbbbb|2 |[city_2, state_2, country_2]|
|222|bbbbb|3 |[city_3, state_3, country_3]|
+---+-----+---+----------------------------+
回答3:
Here is a hacky solution (not ideal as it uses the underlying RDD
) but I have tested it on the scenario where the schema is inconsistent and it seems to be robust:
from pyspark.sql import Row
rdd1 = df.rdd
rdd1.map(lambda x: [(key, val) if key != 'collection' else (key, eval(val))
for key, val in x.asDict().items()])
.map(lambda x: Row(**dict(x)))
.toDF().show()
来源:https://stackoverflow.com/questions/63518774/pyspark-explode-json-string