Process huge GEOJson file with jq

家住魔仙堡 提交于 2019-12-02 11:28:29

A one-pass jq-only approach may require more RAM than is available. If that is the case, then a simple all-jq approach is shown below, together with a more economical approach based on using jq along with awk.

The two approaches are the same except for the reconstitution of the stream of objects into a single JSON document. This step can be accomplished very economically using awk.

In both cases, the large JSON input file with objects of the required form is assumed to be named input.json.

jq-only

jq -c  '.features[]' input.json |
    jq -c '.tippecanoe.minzoom = 13' |
    jq -c -s '{type: "FeatureCollection", features: .}'

jq and awk

jq -c '.features[]' input.json |
   jq -c '.tippecanoe.minzoom = 13' | awk '
     BEGIN {print "{\"type\": \"FeatureCollection\", \"features\": ["; }
     NR==1 { print; next }
           {print ","; print}
     END   {print "] }";}'

Performance comparison

For comparison, an input file with 10,000,000 objects in .features[] was used. Its size is about 1GB.

u+s:

jq-only:              15m 15s
jq-awk:                7m 40s
jq one-pass using map: 6m 53s

An alternative solution could be for example:

jq '.features |= map_values(.tippecanoe.minzoom = 13)'

To test this, I created a sample JSON as

d = {'features': [{"type":"Feature", "properties":{"FEATCODE": 15014}} for i in range(0,N)]}

and inspected the execution time as a function of N. Interestingly, while the map_values approach seems to have linear complexity in N, .features[].tippecanoe.minzoom = 13 exhibits quadratic behavior (already for N=50000, the former method finishes in about 0.8 seconds, while the latter needs around 47 seconds)

Alternatively, one might just do it manually with, e.g., Python:

import json
import sys

data = {}
with open(sys.argv[1], 'r') as F:
    data = json.load(F)

extra_item = {"minzoom" : 13}
for feature in data['features']:
    feature["tippecanoe"] = extra_item

with open(sys.argv[2], 'w') as F:
    F.write(json.dumps(data))

In this case, map rather than map_values is far faster (*):

.features |= map(.tippecanoe.minzoom = 13)

However, using this approach will still require enough RAM.

p.s. If you want to use jq to generate a large file for timing, consider:

def N: 1000000;

def data:
   {"features": [range(0;N) | {"type":"Feature", "properties": {"FEATCODE": 15014}}] };

(*) Using map, 20s for 100MB, and approximately linear.

Here, based on the work of @nicowilliams at GitHub, is a solution that uses the streaming parser available with jq. The solution is very economical with memory, but is currently quite slow if the input is large.

The solution has two parts: a function for injecting the update into the stream produced using the --stream command-line option; and a function for converting the stream back to JSON in the original form.

Invocation:

jq -cnr --stream -f program.jq input.json

program.jq

# inject the given object into the stream produced from "inputs" with the --stream option
def inject(object):
  [object|tostream] as $object
  | 2
  | truncate_stream(inputs)
  | if (.[0]|length == 1) and length == 1
    then $object[]
    else .
    end ;

# Input: the object to be added
# Output: text
def output:
  . as $object
  | ( "[",
      foreach fromstream( inject($object) ) as $o
        (0;
         if .==0 then 1 else 2 end;
         if .==1 then $o else ",", $o end),
      "]" ) ;

{}
| .tippecanoe.minzoom = 13
| output

Generation of test data

def data(N):
 {"features":
  [range(0;2) | {"type":"Feature", "properties": {"FEATCODE": 15014}}] };

Example output

With N=2:

[
{"type":"Feature","properties":{"FEATCODE":15014},"tippecanoe":{"minzoom":13}}
,
{"type":"Feature","properties":{"FEATCODE":15014},"tippecanoe":{"minzoom":13}}
]
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!