Regex to match nested json objects

匿名 (未验证) 提交于 2019-12-03 01:49:02

问题:

I'm implementing some kind of parser and I need to locate and deserialize json object embedded into other semi-structured data. I used regexp:

\\{\\s*title.*?\\} 

to locate object

{title:'Title'} 

but it doesn't work with nested objects because expression matches only first found closing curly bracket. For

{title:'Title',{data:'Data'}} 

it matches

{title:'Title',{data:'Data'} 

so string becomes invalid for deserialization. I understand that there's a greedy business coming into account but I'm not familiar with regexps. Could you please help me to extend expression to consume all available closing curly brackets.

Update:

To be clear, this is an attempt to extract JSON data from semi-structured data like HTML+JS with embedded JSON. I'm using GSon JAVA lib to actually parse extracted JSON.

回答1:

As others have suggested, a full-blown JSON parser is probably the way to go. If you want to match the key-value pairs in the simple examples that you have above, you could use:

(?

For the input string

{title:'Title',  {data:'Data', {foo: 'Bar'}}} 

This matches:

 1. title:'Title'  2. data:'Data'  3. foo: 'Bar' 


回答2:

Thanks to @Sanjay T. Sharma that pointed me to "brace matching" because I eventually got some understanding of greedy expressions and also thanks to others for saying initially what I shouldn't do. Fortunately it turned out it's OK to use greedy variant of expression

\\{\s*title.*\\} 

because there is no non-JSON data between closing brackets.



回答3:

This is absolutely horrible and I can't believe I'm actually putting my name to this solution, but could you not locate the first { character that is in a Javascript block and attempt to parse the remaining characters through a proper JSON parsing library? If it works, you've got a match. If it doesn't, keep reading until the next { character and start over.

There are a few issues there, but they can probably be worked around:

  • you need to be able to identify Javascript blocks. Most languages have HTML to DOM libraries (I'm a big fan of Cyberneko for Java) that makes it easy to focus on the blocks.
  • your JSON parsing library needs to stop consuming characters from the stream as soon as it spots an error, and it needs to not close the stream when it does.

An improvement would be, once you've found the first {, to look for the matching } one (a simple counter that is incremented whenever you find a { and decremented when you find a } should do the trick). Attempt to parse the resulting string as JSON. Iterate until it works or you've ran out of likely blocks.

This is ugly, hackish and should never make it to production code. I get the impression that you only need it for a batch-job, though, which is why I'm even suggesting it.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!