jq streaming - filter nested list and retain global structure

非 Y 不嫁゛ 提交于 2020-06-25 21:14:38

问题


In a large json file, I want to remove some elements from a nested list, but keep the overall structure of the document.

My example input it this (but the real one is large enough to demand streaming).

{
  "keep_untouched": {
    "keep_this": [
      "this",
      "list"
    ]
  },
  "filter_this":
  [
    {"keep" : "true"},
    {
      "keep": "true",
      "extra": "keeper"
    } ,
    {
      "keep": "false",
      "extra": "non-keeper"
    }
  ]
}

The required output just has one element of the 'filter_this' block removed:

{
  "keep_untouched": {
    "keep_this": [
      "this",
      "list"
    ]
  },
  "filter_this":
  [
    {"keep" : "true"},
    {
      "keep": "true",
      "extra": "keeper"
    } ,
  ]
}

The standard way to handle such cases appears to be using 'truncate_stream' to reconstitute streamed objects, before filtering those in the usual jq way. Specifically, the command:

jq -nc --stream 'fromstream(1|truncate_stream(inputs))' 

gives access to a stream of objects:

{"keep_this":["this","list"]}
[{"keep":"true"},{"keep":"true","extra":"keeper"}, 
 {"keep":"false","extra":"non-keeper"}]

at which point it is easy to filter for the required objects. However, this strips the results from the context of their parent object, which is not what I want.

Looking at the streaming structure:

[["keep_untouched","keep_this",0],"this"]
[["keep_untouched","keep_this",1],"list"]
[["keep_untouched","keep_this",1]]
[["keep_untouched","keep_this"]]
[["filter_this",0,"keep"],"true"]
[["filter_this",0,"keep"]]
[["filter_this",1,"keep"],"true"]
[["filter_this",1,"extra"],"keeper"]
[["filter_this",1,"extra"]]
[["filter_this",2,"keep"],"false"]
[["filter_this",2,"extra"],"non-keeper"]
[["filter_this",2,"extra"]]
[["filter_this",2]]
[["filter_this"]]

it seems I need to select all the 'filter_this' rows, truncate those rows only (using 'truncate_stream'), rebuild these rows as objects (using 'from_stream'), filter them, and turn the objects back into the stream data format (using 'tostream') to join the stream of 'keep untouched' rows, which are still in the streaming format. At that point it would be possible to re-build the whole json. If that is the right approach - which seems overly converluted to me - how do I do that? Or is there a better way?


回答1:


If your input file consists of a single very large JSON entity that is too big for the regular jq parser to handle in your environment, then there is the distinct possibility that you won't have enough memory to reconstitute the JSON document.

With that caveat, the following may be worth a try. The key insight is that reconstruction can be accomplished using reduce.

The following uses a bunch of temporary files for the sake of clarity:

TMP=/tmp/$$

jq -c --stream 'select(length==2)' input.json > $TMP.streamed

jq -c 'select(.[0][0] != "filter_this")' $TMP.streamed > $TMP.1

jq -c 'select(.[0][0] == "filter_this")' $TMP.streamed |
  jq -nc 'reduce inputs as [$p,$x] (null; setpath($p;$x))
          | .filter_this |= map(select(.keep=="true"))
          | tostream
          | select(length==2)' > $TMP.2

# Reconstruction
jq -n 'reduce inputs as [$p,$x] (null; setpath($p;$x))' $TMP.1 $TMP.2

Output

{
  "keep_untouched": {
    "keep_this": [
      "this",
      "list"
    ]
  },
  "filter_this": [
    {
      "keep": "true"
    },
    {
      "keep": "true",
      "extra": "keeper"
    }
  ]
}



回答2:


Many thanks to @peak. I found his approach really useful, but unrealistic in terms of performance. Stealing some of @peak's ideas, though, I came up with the following:

Extract the 'parent' object:

jq -c --stream 'select(length==2)' input.json | 
  jq -c 'select(.[0][0] != "filter_this")'  | 
  jq -n 'reduce inputs as [$p,$x] (null; setpath($p;$x))' > $TMP.parent

Extract the 'keepers' - though this means reading the file twice (:-<):

jq -nc --stream '[fromstream(2|truncate_stream(inputs))
                  | select(type == "object" and .keep == "true")]             
                ' input.json > $TMP.keepers

Insert the filtered list into the parent object.

jq -nc -s 'inputs as $items
           | $items[0] as $parent
           | $parent
           | .filter_this |= $items[1]
          '  $TMP.parent $TMP.keepers > result.json



回答3:


Here is a simplified version of @PeteC's script. It requires one fewer invocations of jq.

In both cases, please note that the invocation of jq that uses "2|truncate_stream(_)" requires a more recent version of jq than 1.5.

TMP=/tmp/$$

INPUT=input.json

# Extract all but .filter_this
< $INPUT jq -c --stream 'select(length==2 and .[0][0] != "filter_this")' |
    jq -nc 'reduce inputs as [$p,$x] (null; setpath($p;$x))
           ' > $TMP.parent

# Need jq > 1.5
# Extract the 'keepers'
< $INPUT jq -n -c --stream '
  [fromstream(2|truncate_stream(inputs))
   | select(type == "object" and .keep == "true")]
  ' $INPUT > $TMP.keepers

# Insert the filtered list into the parent object:
jq -s '. as $in | .[0] | (.filter_this |= $in[1])
      ' $TMP.parent $TMP.keepers > result.json


来源:https://stackoverflow.com/questions/49531722/jq-streaming-filter-nested-list-and-retain-global-structure

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!