Importing wrongly concatenated JSONs in python

I've a text document that has several thousand jsons strings in the form of: "{...}{...}{...}". This is not a valid json it self but each {...} is.

I currently use the following a regular expression to split them:

fp = open('my_file.txt', 'r')
raw_dataset = (re.sub('}{', '}\n{', fp.read())).split('\n')

Which basically breaks every line where a curly bracket closes and other opens (}{ -> }\n{) so I can split them into different lines.

The problem is that few of them have a tags attribute written as "{tagName1}{tagName2}" which breaks my regular expression.

An example would be:

'{"name":\"Bob Dylan\", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}'

Is parsed into

'{"name":"Bob Dylan", "tags":"{Artist}'
'{Singer}"}'
'{"name": "Michael Jackson"}'

instead of

'{"name":"Bob Dylan", "tags":"{Artist}{Singer}"}'
'{"name": "Michael Jackson"}'

What is the proper way of achieve this for further json parsing?

Use the raw_decode method of json.JSONDecoder

>>> import json
>>> d = json.JSONDecoder()
>>> x='{"name":\"Bob Dylan\", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}'
>>> d.raw_decode(x)
({'tags': '{Artist}{Singer}', 'name': 'Bob Dylan'}, 47)
>>> x=x[47:]
>>> d.raw_decode(x)
({'name': 'Michael Jackson'}, 27)

raw_decode returns a 2-tuple, the first element being the decoded JSON and the second being the offset in the string of the next byte after the JSON ended.

To loop until the end or until an invalid JSON element is encountered:

>>> while True:
...   try:
...     j,n = d.raw_decode(x)
...   except ValueError:
...     break
...   print(j)
...   x=x[n:]
... 
{'name': 'Bob Dylan', 'tags': '{Artist}{Singer}'}
{'name': 'Michael Jackson'}

When the loop breaks, inspection of x will reveal if it has processed the whole string or had encountered a JSON syntax error.

With a very long file of short elements you might read a chunk into a buffer and apply the above loop, concatenating anything that's left over with the next chunk after the loop breaks.

You can use the jq command line utility to transfer your input to json. Let's say you have the following input:

input.txt:

{"name":"Bob Dylan", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}

You can use jq -s, which consumes multiple json documents from input and transfers them into a single output array:

jq -s . input.txt

Gives you:

[
  {
    "name": "Bob Dylan",
    "tags": "{Artist}{Singer}"
  },
  {
    "name": "Michael Jackson"
  }
]

I've just realized that there are python bindings for libjq. Meaning you don't need to use the command line, you can use jq directly in python.

https://github.com/mwilliamson/jq.py

However, I've not tried it so far. Let me give it a try :) ...

Update: The above library is nice, but it does not support the slurp mode so far.

you need to make a parser ... I dont think regex can help you for

data = ""
curlies = []
def get_dicts(file_text):
    for letter in file_text:
        data += letter
        if letter == "{":
           curlies.append(letter)
        elif letter == "}":
           curlies.pop() # remove last
           if not curlies:
              yield json.loads(data)
              data = ""

note that this does not actually solve the problem that {name:"bob"} is not valid json ... {"name":"bob"} is

this will also break in the event you have weird unbalanced parenthesis inside of strings ie {"name":"{{}}}"} would break this

really your json is so broken based on your example your best bet is probably to edit it by hand and fix the code that is generating it ... if that is not feasible you may need to write a more complex parser using pylex or some other grammar library (effectively writing your own language parser)

来源：https://stackoverflow.com/questions/36019907/importing-wrongly-concatenated-jsons-in-python

标签

python

json

regex