Importing wrongly concatenated JSONs in python

前提是你 提交于 2019-12-01 01:24:47

Use the raw_decode method of json.JSONDecoder

>>> import json
>>> d = json.JSONDecoder()
>>> x='{"name":\"Bob Dylan\", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}'
>>> d.raw_decode(x)
({'tags': '{Artist}{Singer}', 'name': 'Bob Dylan'}, 47)
>>> x=x[47:]
>>> d.raw_decode(x)
({'name': 'Michael Jackson'}, 27)

raw_decode returns a 2-tuple, the first element being the decoded JSON and the second being the offset in the string of the next byte after the JSON ended.

To loop until the end or until an invalid JSON element is encountered:

>>> while True:
...   try:
...     j,n = d.raw_decode(x)
...   except ValueError:
...     break
...   print(j)
...   x=x[n:]
{'name': 'Bob Dylan', 'tags': '{Artist}{Singer}'}
{'name': 'Michael Jackson'}

When the loop breaks, inspection of x will reveal if it has processed the whole string or had encountered a JSON syntax error.

With a very long file of short elements you might read a chunk into a buffer and apply the above loop, concatenating anything that's left over with the next chunk after the loop breaks.

You can use the jq command line utility to transfer your input to json. Let's say you have the following input:


{"name":"Bob Dylan", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}

You can use jq -s, which consumes multiple json documents from input and transfers them into a single output array:

jq -s . input.txt

Gives you:

    "name": "Bob Dylan",
    "tags": "{Artist}{Singer}"
    "name": "Michael Jackson"

I've just realized that there are python bindings for libjq. Meaning you don't need to use the command line, you can use jq directly in python.

However, I've not tried it so far. Let me give it a try :) ...

Update: The above library is nice, but it does not support the slurp mode so far.

you need to make a parser ... I dont think regex can help you for

data = ""
curlies = []
def get_dicts(file_text):
    for letter in file_text:
        data += letter
        if letter == "{":
        elif letter == "}":
           curlies.pop() # remove last
           if not curlies:
              yield json.loads(data)
              data = ""

note that this does not actually solve the problem that {name:"bob"} is not valid json ... {"name":"bob"} is

this will also break in the event you have weird unbalanced parenthesis inside of strings ie {"name":"{{}}}"} would break this

really your json is so broken based on your example your best bet is probably to edit it by hand and fix the code that is generating it ... if that is not feasible you may need to write a more complex parser using pylex or some other grammar library (effectively writing your own language parser)