Reading pretty print json files in Apache Spark

问题

I have a lot of json files in my S3 bucket and I want to be able to read them and query those files. The problem is they are pretty printed. One json file has just one massive dictionary but it's not in one line. As per this thread, a dictionary in the json file should be in one line which is a limitation of Apache Spark. I don't have it structured that way.

My JSON schema looks like this -

{
    "dataset": [
        {
            "key1": [
                {
                    "range": "range1", 
                    "value": 0.0
                }, 
                {
                    "range": "range2", 
                    "value": 0.23
                }
             ]
        }, {..}, {..}
    ],
    "last_refreshed_time": "2016/09/08 15:05:31"
}

Here are my questions -

Can I avoid converting these files to match the schema required by Apache Spark (one dictionary per line in a file) and still be able to read it?
If not, what's the best way to do it in Python? I have a bunch of these files for each day in the bucket. The bucket is partitioned by day.
Is there any other tool better suited to query these files other than Apache Spark? I'm on AWS stack so can try out any other suggested tool with Zeppelin notebook.

回答1:

You could use sc.wholeTextFiles() Here is a related post.

Alternatively, you could reformat your json using a simple function and load the generated file.

def reformat_json(input_path, output_path):
    with open(input_path, 'r') as handle:
        jarr = json.load(handle)

    f = open(output_path, 'w')
    for entry in jarr:
        f.write(json.dumps(entry)+"\n")
    f.close()

来源：https://stackoverflow.com/questions/39453769/reading-pretty-print-json-files-in-apache-spark

标签

python

json

apache-spark

amazon-s3