Stream Parse Huge JSON file into small files

问题

I have around 96 gzip of JSON which is over 350 GB of JSON file after unzipping with following structure

{
  "structe": {},
  "beta": {},
  "flow": {
    "1023": {
      "0101": {
        "-LEjllNyHqdHYGntO6vu": {
          "status": "1",
          "t": 1528736191996
        },
        "-LEjllcXKaVOQu3BDpHF": {
          "status": "1",
          "t": 1528736192996
        }
      },
      "0102": {
        "-LEjllNyHqdHYGntO6vu": {
          "status": "1",
          "t": 1528736191996
        },
        "-LEjllcXKaVOQu3BDpHF": {
          "status": "1",
          "t": 1528736192996
        }
      }
    },
    "1024": {
      "0103": {
        "-LEjllNyHqdHYGntO6vu": {
          "lat": 51.128676733981,
          "lng": -113.9318991267252,
          "status": "1",
          "t": 1528736191996
        },
        "-LEjllcXKaVOQu3BDpHF": {
          "lat": 51.128676733981,
          "lng": -113.9318991267252,
          "status": "1",
          "t": 1528736192996
        }
      }
    }
  }
}

I can't load this in RAM , Now I want to stream this file and pull the path flow->1023(let id1)->0101(let id2) into new id1_id2.json file. Any thought how can do this with speed. Output i am looking for is like File name = 1023_0101.json

{
        "-LEjllNyHqdHYGntO6vu": {
          "status": "1",
          "t": 1528736191996
        },
        "-LEjllcXKaVOQu3BDpHF": {
          "status": "1",
          "t": 1528736192996
        }
      }

回答1:

Here's a solution that uses jq's streaming parser to produce a stream consisting of $id1, $id2, and the corresponding value of interest; this stream can then be piped into another tool (e.g. awk if that's convenient) to produce the desired files.

In the following, we use atomize from the jq cookbook:

  def atomize(s):
    fromstream(foreach s as $in ( {previous:null, emit: null};
      if ($in | length == 2) and ($in|.[0][0]) != .previous and .previous != null
      then {emit: [[.previous]], previous: $in|.[0][0]}
      else { previous: ($in|.[0][0]), emit: null}
      end;
      (.emit // empty), $in) ) ;

The main jq program (invoked with --stream -n -c) is then simply:

atomize(inputs)
| select(type == "object" and .flow)
| .flow
| keys_unsorted[] as $id1
| (.[$id1] | keys_unsorted[]) as $id2
| $id1, $id2, .[$id1][$id2]

So for each gzip file, $gz, the pipeline would look like this:

gunzip -c $gz | jq -nc --stream -f program.jq | awk ....

For an example of using awk to produce the desired result, see jq, split a huge json of array and save into file named with a value

Caveat and Addendum

jq's streaming parser avoids using RAM at the cost of speed, so usually using the --stream option is only done as a last resort. From the description of the problem, it looks like you might be able to process some of the zipped files using jq's regular parser, so you might want to process those files speedily, leaving the "atomize" approach for those files that are too big.

Caution

The problem description does not make it clear what should be done if there is an id1_id2.json collision. If there is no possibility of such a collision, then of course there's no problem. Otherwise, it would be up to the program that creates those files to manage that contingency.

回答2:

You can use jq with the --stream option, jq - I/O (Streaming) set, that reads texts in a streaming fashion, allowing programs to start processing large JSON texts immediately rather than after the parse completes (storing entire file in RAM).

Assuming your input id strings are stored in a shell variable context

id1=1023; id2=0101

Pipe the output of your huge gzip to the following filter

jq --arg v1 "$id1" --arg v2 "$id2" --stream 'fromstream(inputs)| objects | .flow[$v1][$v2]' > "$id1"_"$id2".json

(or) if the id names can't be pre-fetched and you need to fetch them on the run, directly use their names as

jq --stream 'fromstream(inputs)| objects | .flow."1023"."0101"'

回答3:

What first coming on my mind is treating the file like stream and reading it line by line. There are some libraries already which are treating the json files as streams. For example, you can check out the snippet from ijson library:

For JSON like:

{
  "earth": {
    "europe": [
      {"name": "Paris", "type": "city", "info": { ... }},
      {"name": "Thames", "type": "river", "info": { ... }},
      // ...
    ],
    "america": [
      {"name": "Texas", "type": "state", "info": { ... }},
      // ...
    ]
  }
}

Treatment would look like:

import ijson

parser = ijson.parse(urlopen('http://.../'))
stream.write('<geo>')
for prefix, event, value in parser:
    if (prefix, event) == ('earth', 'map_key'):
        stream.write('<%s>' % value)
        continent = value
    elif prefix.endswith('.name'):
        stream.write('<object name="%s"/>' % value)
    elif (prefix, event) == ('earth.%s' % continent, 'end_map'):
        stream.write('</%s>' % continent)
stream.write('</geo>')

来源：https://stackoverflow.com/questions/58408121/stream-parse-huge-json-file-into-small-files

标签

python

json

stream