Parsing JSON object with large number of unique keys (not a list of objects) using PySpark

问题

I'm currently dealing with the following source data in a JSON file:

{
    "unique_key_1": {
        "some_value_1": 1,
        "some_value_2": 2
    },
    "unique_key_2": {
        "some_value_1": 2,
        "some_value_2": 3
    }
    "unique_key_3": {
        "some_value_1": 2,
        "some_value_2": 1
    }
    ...
}

Note that the source data is effective a large dictionary, with lots of unique keys. It is NOT a list of dictionaries. I have lots of large JSON files like this that I want to parse into the following DataFrame structure using PySpark:

key          | some_value_1 | some_value_2
-------------------------------------------
unique_key_1 |            1 |            2
unique_key_2 |            2 |            3
unique_key_3 |            2 |            1

If I was dealing with small files, I could simply parse this using code similar to:

[{**{"key": k}, **v} for (k, v) in source_dict.items()]

Then, I would create a Spark DataFrame on this list and continue on with the rest of the operations I need to do.

My problem is that I can't quite figure out how to parse a large JSON object like this into a DataFrame. When I use SPARK.read.json("source_dict.json"), I get a DataFrame with one row where each of the unique key values is (predictably) read in as a column. Note that the real data files could have > 10s of thousands of these keys.

I'm fairly new to the Spark world, and I can't seem to find a way to accomplish this task. It seems like a pivot or something like that would help. Does anyone have any solutions or pointers to possible solutions? Thanks, I appreciate it!

回答1:

Using flatmap you can write a function to make the transformation

def f(row):
l = []
d = row.asDict()
for k in d.keys():
    l.append(Row(k, d[k][0], d[k][1]))
return Row(*l)

rdd = df.rdd.flatMap(f)
spark.createDataFrame(rdd).show()


+------------+---+---+
|          _1| _2| _3|
+------------+---+---+
|unique_key_1|  1|  2|
|unique_key_2|  2|  3|
|unique_key_3|  2|  1|
+------------+---+---+

For additional info you can see this link

回答2:

The easiest way to get the keys into a separate column is to restructure the json before reading the data into Spark. You would get the desired result out of the box if the JSON were structured like this:

[
    {"key":"unique_key_1",
        "some_value_1": 1,
        "some_value_2": 2
    },
    {"key":"unique_key_2",
        "some_value_1": 2,
        "some_value_2": 3
    },
    {"key":"unique_key_3",
        "some_value_1": 2,
        "some_value_2": 1
    }
]

If you don't have control over the json, you could use the from_json column functions together with explode. First just read the json as single row single column text, and then parse it.

Then first use from_json to parse the text:

json_schema = MapType(StringType(), StringType()) df.withColumn("json", from_json(col('text'), json_schema)) # expand into key-value column

Then, explode the keys of the newly created object into separate rows:

.select(explode(col('json'))) # make a row for each key in the json

Finally, you can do the same for unpacking the values and selecting them into separate columns. Here's a small demo to put it all together:

from pyspark.sql.types import *
from pyspark.sql.functions import *

text_schema = StructType([StructField('text', StringType(), True)])
json_schema = MapType(StringType(), StringType())

data = """{
    "unique_key_1": {
        "some_value_1": 1,
        "some_value_2": 2
    },
    "unique_key_2": {
        "some_value_1": 2,
        "some_value_2": 3
    },
    "unique_key_3": {
        "some_value_1": 2,
        "some_value_2": 1
    }
}
"""

df = (spark.createDataFrame([(data,)], schema=text_schema) # read dataframe
  .withColumn("json", from_json(col('text'), json_schema)) # expand into key-value column 
  .select(explode(col('json'))) # make a row for each key in the json
  .withColumn("value", from_json(col('value'), json_schema)) # now interpret the value for each key as json also
  .withColumn("some_value_1", col("value.some_value_1")) # unpack the object into separate rows
  .withColumn("some_value_2", col("value.some_value_2"))
  .drop('value')
     )

display(df)

来源：https://stackoverflow.com/questions/58827951/parsing-json-object-with-large-number-of-unique-keys-not-a-list-of-objects-usi

标签

python

json

apache-spark

pyspark