问题
I have a very large file (20GB+ compressed) called input.json
containing a stream of JSON objects as follows:
{
"timestamp": "12345",
"name": "Some name",
"type": "typea"
}
{
"timestamp": "12345",
"name": "Some name",
"type": "typea"
}
{
"timestamp": "12345",
"name": "Some name",
"type": "typeb"
}
I want to split this file into files dependent on their type
property: typea.json
, typeb.json
etc., each containing their own stream of json objects that only have the matching type property.
I've managed to solve this problem for smaller files, however with such a large file I run out of memory on my AWS instance. As I wish to keep memory usage down, I understand I need to use --stream
but I'm struggling to see how I can achieve this.
cat input.json | jq -c --stream 'select(.[0][0]=="type") | .[1]'
will return me the values of each of the type properties, but how do I use this to then filter the objects?
Any help would be greatly appreciated!
回答1:
Assuming the JSON objects in the file are relatively small (none more than a few MB), you won't need to use the (rather complex) "--stream" command-line option, which is mainly needed when the input is (or includes) a single humungous JSON entity.
There are however several choices still to be made. The main ones are described at Split a JSON file into separate files, these being a multi-pass approach (N or (N+1) calls to jq, where N is the number of output files), and an approach that involves just one call to jq, followed by a call to a program such as awk
to perform the actual partitioning into files. Each approach has its pros and cons, but if reading the input file N times is acceptable, then the first approach might be better.
To estimate the total computational resources that will be required, it would probably be a good idea to measure the resources used by running jq empty input.json
(From your brief writeup, it sounds like the memory issue you've run into results primarily from the unzipping of the file.)
回答2:
Using jq
to split into a NUL-delimited stream of (type, document) pairs, and using native bash (4.1 or later) to write to those documents using a persistent set of file descriptors:
#!/usr/bin/env bash
case $BASH_VERSION in ''|[1-3].*|4.0*) echo "ERROR: Bash 4.1 needed" >&2; exit 1;; esac
declare -A output_fds=( )
while IFS= read -r -d '' type && IFS= read -r -d '' content; do
if [[ ${output_fds[$type]} ]]; then # already have a file handle for this output file?
curr_fd=${output_fds[$type]} # reuse it, then.
else
exec {curr_fd}>"$type.json" # open a new output file...
output_fds[$type]=$curr_fd # and store its file descriptor for use.
fi
printf '%s\n' "$content" >&"$curr_fd"
done < <(jq -j '(.type) + "\u0000" + (. | tojson) + "\u0000"')
This never reads more than a few records (admittedly, potentially multiple copies of each) into memory at a time, so it'll work with an arbitrarily large file so long as the records are of reasonable size.
来源:https://stackoverflow.com/questions/54725080/using-jq-how-can-i-split-a-json-stream-of-objects-into-separate-files-based-on