Using jq, how can I split a JSON stream of objects into separate files based on the values of an object property?

问题

I have a very large file (20GB+ compressed) called input.json containing a stream of JSON objects as follows:

{
    "timestamp": "12345",
    "name": "Some name",
    "type": "typea"
}
{
    "timestamp": "12345",
    "name": "Some name",
    "type": "typea"
}
{
    "timestamp": "12345",
    "name": "Some name",
    "type": "typeb"
}

I want to split this file into files dependent on their type property: typea.json, typeb.json etc., each containing their own stream of json objects that only have the matching type property.

I've managed to solve this problem for smaller files, however with such a large file I run out of memory on my AWS instance. As I wish to keep memory usage down, I understand I need to use --stream but I'm struggling to see how I can achieve this.

cat input.json | jq -c --stream 'select(.[0][0]=="type") | .[1]' will return me the values of each of the type properties, but how do I use this to then filter the objects?

Any help would be greatly appreciated!

回答1:

Assuming the JSON objects in the file are relatively small (none more than a few MB), you won't need to use the (rather complex) "--stream" command-line option, which is mainly needed when the input is (or includes) a single humungous JSON entity.

There are however several choices still to be made. The main ones are described at Split a JSON file into separate files, these being a multi-pass approach (N or (N+1) calls to jq, where N is the number of output files), and an approach that involves just one call to jq, followed by a call to a program such as awk to perform the actual partitioning into files. Each approach has its pros and cons, but if reading the input file N times is acceptable, then the first approach might be better.

To estimate the total computational resources that will be required, it would probably be a good idea to measure the resources used by running jq empty input.json

(From your brief writeup, it sounds like the memory issue you've run into results primarily from the unzipping of the file.)

回答2:

Using jq to split into a NUL-delimited stream of (type, document) pairs, and using native bash (4.1 or later) to write to those documents using a persistent set of file descriptors:

#!/usr/bin/env bash
case $BASH_VERSION in ''|[1-3].*|4.0*) echo "ERROR: Bash 4.1 needed" >&2; exit 1;; esac

declare -A output_fds=( )

while IFS= read -r -d '' type && IFS= read -r -d '' content; do
  if [[ ${output_fds[$type]} ]]; then  # already have a file handle for this output file?
    curr_fd=${output_fds[$type]}       # reuse it, then.
  else
    exec {curr_fd}>"$type.json"        # open a new output file...
    output_fds[$type]=$curr_fd         # and store its file descriptor for use.
  fi
  printf '%s\n' "$content" >&"$curr_fd"
done < <(jq -j '(.type) + "\u0000" + (. | tojson) + "\u0000"')

This never reads more than a few records (admittedly, potentially multiple copies of each) into memory at a time, so it'll work with an arbitrarily large file so long as the records are of reasonable size.

来源：https://stackoverflow.com/questions/54725080/using-jq-how-can-i-split-a-json-stream-of-objects-into-separate-files-based-on

标签

json

bash

stream

partitioning