Python: parsing JSON-like Javascript data structures (w/ consecutive commas)

前端 未结 6 1813
南旧
南旧 2020-12-10 22:55

I would like to parse JSON-like strings. Their lone difference with normal JSON is the presence of contiguous commas in arrays. When there are two such commas, it i

相关标签:
6条回答
  • 2020-12-10 23:02

    I've had a look at Taymon recommendation, pyparsing, and I successfully hacked the example provided here to suit my needs. It works well at simulating Javascript eval() but fails one situation: trailing commas. There should be a optional trailing comma – see tests below – but I can't find any proper way to implement this.

    from pyparsing import *
    
    TRUE = Keyword("true").setParseAction(replaceWith(True))
    FALSE = Keyword("false").setParseAction(replaceWith(False))
    NULL = Keyword("null").setParseAction(replaceWith(None))
    
    jsonString = dblQuotedString.setParseAction(removeQuotes)
    jsonNumber = Combine(Optional('-') + ('0' | Word('123456789', nums)) +
                        Optional('.' + Word(nums)) +
                        Optional(Word('eE', exact=1) + Word(nums + '+-', nums)))
    
    jsonObject = Forward()
    jsonValue = Forward()
    # black magic begins
    commaToNull = Word(',,', exact=1).setParseAction(replaceWith(None))
    jsonElements = ZeroOrMore(commaToNull) + Optional(jsonValue) + ZeroOrMore((Suppress(',') + jsonValue) | commaToNull)
    # black magic ends
    jsonArray = Group(Suppress('[') + Optional(jsonElements) + Suppress(']'))
    jsonValue << (jsonString | jsonNumber | Group(jsonObject) | jsonArray | TRUE | FALSE | NULL)
    memberDef = Group(jsonString + Suppress(':') + jsonValue)
    jsonMembers = delimitedList(memberDef)
    jsonObject << Dict(Suppress('{') + Optional(jsonMembers) + Suppress('}'))
    
    jsonComment = cppStyleComment
    jsonObject.ignore(jsonComment)
    
    def convertNumbers(s, l, toks):
        n = toks[0]
        try:
            return int(n)
        except ValueError:
            return float(n)
    
    jsonNumber.setParseAction(convertNumbers)
    
    def test():
        tests = (
            '[1,2]',       # ok
            '[,]',         # ok
            '[,,]',        # ok
            '[  , ,  , ]', # ok
            '[,1]',        # ok
            '[,,1]',       # ok
            '[1,,2]',      # ok
            '[1,]',        # failure, I got [1, None], I should have [1]
            '[1,,]',       # failure, I got [1, None, None], I should have [1, None]
        )
        for test in tests:
            results = jsonArray.parseString(test)
            print(results.asList())
    
    0 讨论(0)
  • 2020-12-10 23:06

    It's a hackish way of doing it, but one solution is to simply do some string modification on the JSON-ish data to get it in line before parsing it.

    import re
    import json
    
    not_quite_json = '["foo",,,"bar",[1,,3,4]]'
    not_json = True
    while not_json:
        not_quite_json, not_json = re.subn(r',\s*,', ', null, ', not_quite_json)
    

    Which leaves us with:

    '["foo", null, null, "bar",[1, null, 3,4]]'
    

    We can then do:

    json.loads(not_quite_json)
    

    Giving us:

    ['foo', None, None, 'bar', [1, None, 3, 4]]
    

    Note that it's not as simple as a replace, as the replacement also inserts commas that can need replacing. Given this, you have to loop through until no more replacements can be made. Here I have used a simple regex to do the job.

    0 讨论(0)
  • 2020-12-10 23:12

    You can do the comma replacement of Lattyware's/przemo_li's answers in one pass by using a lookbehind expression, i.e. "replace all commas that are preceded by just a comma":

    >>> s = '["foo",,,"bar",[1,,3,4]]'
    
    >>> re.sub(r'(?<=,)\s*,', ' null,', s)
    '["foo", null, null,"bar",[1, null,3,4]]'
    

    Note that this will work for small things where you can assume there aren't consecutive commas in string literals, for example. In general, regular expressions aren't enough to handle this problem, and Taymon's approach of using a real parser is the only fully correct solution.

    0 讨论(0)
  • 2020-12-10 23:15

    Since what you're trying to parse isn't JSON per se, but rather a different language that's very much like JSON, you may need your own parser.

    Fortunately, this isn't as hard as it sounds. You can use a Python parser generator like pyparsing. JSON can be fully specified with a fairly simple context-free grammar (I found one here), so you should be able to modify it to fit your needs.

    0 讨论(0)
  • 2020-12-10 23:25

    Small & simple workaround to try out:

    1. Convert JSON-like data to strings.
    2. Replace ",," with ",null,".
    3. Convert it to whatever is your representation.
    4. Let JSONDecoder(), do the heavy lifting.

      1. & 3. can be omitted if you already deal with strings.

    (And if converting to string is impractical, update your question with this info!)

    0 讨论(0)
  • 2020-12-10 23:25

    For those looking for something quick and dirty to convert general JS objects (to dicts). Some part of the page of one real site gives me some object I'd like to tackle. There are 'new' constructs for dates, and it's in one line, no spaces in between, so two lines suffice:

    data=sub(r'new Date\(([^)])*\)', r'\1', data)
    data=sub(r'([,{])(\w*):', r'\1"\2":', data)
    

    Then json.loads() worked fine. Your mileage may vary:)

    0 讨论(0)
提交回复
热议问题