问题
I'm currently dealing with the following source data in a JSON file:
{
"unique_key_1": {
"some_value_1": 1,
"some_value_2": 2
},
"unique_key_2": {
"some_value_1": 2,
"some_value_2": 3
}
"unique_key_3": {
"some_value_1": 2,
"some_value_2": 1
}
...
}
Note that the source data is effective a large dictionary, with lots of unique keys. It is NOT a list of dictionaries. I have lots of large JSON files like this that I want to parse into the following DataFrame structure using PySpark:
key | some_value_1 | some_value_2
-------------------------------------------
unique_key_1 | 1 | 2
unique_key_2 | 2 | 3
unique_key_3 | 2 | 1
If I was dealing with small files, I could simply parse this using code similar to:
[{**{"key": k}, **v} for (k, v) in source_dict.items()]
Then, I would create a Spark DataFrame on this list and continue on with the rest of the operations I need to do.
My problem is that I can't quite figure out how to parse a large JSON object like this into a DataFrame. When I use SPARK.read.json("source_dict.json")
, I get a DataFrame with one row where each of the unique key
values is (predictably) read in as a column. Note that the real data files could have > 10s of thousands of these keys.
I'm fairly new to the Spark world, and I can't seem to find a way to accomplish this task. It seems like a pivot or something like that would help. Does anyone have any solutions or pointers to possible solutions? Thanks, I appreciate it!
回答1:
Using flatmap you can write a function to make the transformation
def f(row):
l = []
d = row.asDict()
for k in d.keys():
l.append(Row(k, d[k][0], d[k][1]))
return Row(*l)
rdd = df.rdd.flatMap(f)
spark.createDataFrame(rdd).show()
+------------+---+---+
| _1| _2| _3|
+------------+---+---+
|unique_key_1| 1| 2|
|unique_key_2| 2| 3|
|unique_key_3| 2| 1|
+------------+---+---+
For additional info you can see this link
回答2:
The easiest way to get the keys into a separate column is to restructure the json before reading the data into Spark. You would get the desired result out of the box if the JSON were structured like this:
[
{"key":"unique_key_1",
"some_value_1": 1,
"some_value_2": 2
},
{"key":"unique_key_2",
"some_value_1": 2,
"some_value_2": 3
},
{"key":"unique_key_3",
"some_value_1": 2,
"some_value_2": 1
}
]
If you don't have control over the json, you could use the from_json
column functions together with explode
. First just read the json as single row single column text, and then parse it.
Then first use from_json
to parse the text:
json_schema = MapType(StringType(), StringType())
df.withColumn("json", from_json(col('text'), json_schema)) # expand into key-value column
Then, explode the keys of the newly created object into separate rows:
.select(explode(col('json'))) # make a row for each key in the json
Finally, you can do the same for unpacking the values and selecting them into separate columns. Here's a small demo to put it all together:
from pyspark.sql.types import *
from pyspark.sql.functions import *
text_schema = StructType([StructField('text', StringType(), True)])
json_schema = MapType(StringType(), StringType())
data = """{
"unique_key_1": {
"some_value_1": 1,
"some_value_2": 2
},
"unique_key_2": {
"some_value_1": 2,
"some_value_2": 3
},
"unique_key_3": {
"some_value_1": 2,
"some_value_2": 1
}
}
"""
df = (spark.createDataFrame([(data,)], schema=text_schema) # read dataframe
.withColumn("json", from_json(col('text'), json_schema)) # expand into key-value column
.select(explode(col('json'))) # make a row for each key in the json
.withColumn("value", from_json(col('value'), json_schema)) # now interpret the value for each key as json also
.withColumn("some_value_1", col("value.some_value_1")) # unpack the object into separate rows
.withColumn("some_value_2", col("value.some_value_2"))
.drop('value')
)
display(df)
来源:https://stackoverflow.com/questions/58827951/parsing-json-object-with-large-number-of-unique-keys-not-a-list-of-objects-usi