问题
I have JSON exported from Cassandra in this format.
[
{
"correlationId": "2232845a8556cd3219e46ab8",
"leg": 0,
"tag": "received",
"offset": 263128,
"len": 30,
"prev": {
"page": {
"file": 0,
"page": 0
},
"record": 0
},
"data": "HEAD /healthcheck HTTP/1.1\r\n\r\n"
},
{
"correlationId": "2232845a8556cd3219e46ab8",
"leg": 0,
"tag": "sent",
"offset": 262971,
"len": 157,
"prev": {
"page": {
"file": 10330,
"page": 6
},
"record": 1271
},
"data": "HTTP/1.1 200 OK\r\nDate: Wed, 14 Feb 2018 12:57:06 GMT\r\nServer: \r\nConnection: close\r\nX-CorrelationID: Id-2232845a8556cd3219e46ab8 0\r\nContent-Type: text/xml\r\n\r\n"
}]
I would like to split it to separate documents:
{ "correlationId": "2232845a8556cd3219e46ab8", "leg": 0, "tag": "received", "offset": 263128, "len": 30, "prev": { "page": { "file": 0, "page": 0 }, "record": 0 }, "data": "HEAD /healthcheck HTTP/1.1\r\n\r\n" }
and
{ "correlationId": "2232845a8556cd3219e46ab8", "leg": 0, "tag": "sent", "offset": 262971, "len": 157, "prev": { "page": { "file": 10330, "page": 6 }, "record": 1271 }, "data": "HTTP/1.1 200 OK\r\nDate: Wed, 14 Feb 2018 12:57:06 GMT\r\nServer: \r\nConnection: close\r\nX-CorrelationID: Id-2232845a8556cd3219e46ab8 0\r\nContent-Type: text/xml\r\n\r\n" }
I wanted to use jq but didn't find way how.
Can you please advise way, how to split it by the document separator?
Thanks, Reddy
回答1:
Using jq, one can split an array into its components using the filter:
.[]
The question then becomes what is to be done with each component. If you want to direct each component to a separate file, you could (for example) use jq with the -c option, and filter the result into awk, which can then allocate the components to different files. See e.g. Split JSON File Objects Into Multiple Files
Performance considerations
One might think that the overhead of calling jq+awk would be high compared to calling python, but both jq and awk are lightweight compared to python+json, as suggested by these timings (using Python 2.7.10):
time (jq -c .[] input.json | awk '{print > "doc00" NR ".json";}')
user 0m0.005s
sys 0m0.008s
time python split.py
user 0m0.016s
sys 0m0.046s
回答2:
In case you have an array of 2 objects:
jq '.[0]' input.json > doc1.json && jq '.[1]' input.json > doc2.json
Results:
$ head -n100 doc[12].json
==> doc1.json <==
{
"correlationId": "2232845a8556cd3219e46ab8",
"leg": 0,
"tag": "received",
"offset": 263128,
"len": 30,
"prev": {
"page": {
"file": 0,
"page": 0
},
"record": 0
},
"data": "HEAD /healthcheck HTTP/1.1\r\n\r\n"
}
==> doc2.json <==
{
"correlationId": "2232845a8556cd3219e46ab8",
"leg": 0,
"tag": "sent",
"offset": 262971,
"len": 157,
"prev": {
"page": {
"file": 10330,
"page": 6
},
"record": 1271
},
"data": "HTTP/1.1 200 OK\r\nDate: Wed, 14 Feb 2018 12:57:06 GMT\r\nServer: \r\nConnection: close\r\nX-CorrelationID: Id-2232845a8556cd3219e46ab8 0\r\nContent-Type: text/xml\r\n\r\n"
}
回答3:
You can do it more efficiently using Python (because you can read the entire input once, instead of once per document):
import json
docs = json.load(open('in.json'))
for ii, doc in enumerate(docs):
with open('doc{}.json'.format(ii), 'w') as out:
json.dump(doc, out, indent=2)
回答4:
To split a json with many records into chunks of a desired size I simply use:
jq -c '.[0:1000]' mybig.json
which works like python slicing.
See the docs here: https://stedolan.github.io/jq/manual/
Array/String Slice: .[10:15]
The .[10:15] syntax can be used to return a subarray of an array or substring of a string. The array returned by .[10:15] will be of length 5, containing the elements from index 10 (inclusive) to index 15 (exclusive). Either index may be negative (in which case it counts backwards from the end of the array), or omitted (in which case it refers to the start or end of the array).
回答5:
One way to do this is using jq's stream option and piping that to the split command
jq -cn --stream 'fromstream(1|truncate_stream(inputs))' bigfile.json | split -l $num_of_elements_in_a_file - big_part
The number of lines per file varies according to the value that you put into num_of_elements_in_a_file,
You can check out this answer Using jq how can I split a very large JSON file into multiple files, each a specific quantity of objects? which refers to this page for a discussion on how to use the streaming parser https://github.com/stedolan/jq/wiki/FAQ#streaming-json-parser
来源:https://stackoverflow.com/questions/48790861/split-json-array-into-separate-files-objects