Regular expression to parse an array of JSON objects?

孤街醉人 提交于 2019-11-27 02:45:22

问题


I'm trying to parse an array of JSON objects into an array of strings in C#. I can extract the array from the JSON object, but I can't split the array string into an array of individual objects.

What I have is this test string:

string json = "{items:[{id:0,name:\"Lorem Ipsum\"},{id:1,name" 
            + ":\"Lorem Ipsum\"},{id:2,name:\"Lorem Ipsum\"}]}";

Right now I'm using the following regular expressions right now to split the items into individual objects. For now they're 2 separate regular expressions until I fix the problem with the second one:

Regex arrayFinder = new Regex(@"\{items:\[(?<items>[^\]]*)\]\}"
                                 , RegexOptions.ExplicitCapture);
Regex arrayParser = new Regex(@"((?<items>\{[^\}]\}),?)+"
                                 , RegexOptions.ExplicitCapture);

The arrayFinder regex works the way I'd expect it but, for reasons I don't understand, the arrayParser regex doesn't work at all. All I want it to do is split the individual items into their own strings so I get a list like this:

{id:0,name:"Lorem Ipsum"}
{id:1,name:"Lorem Ipsum"}
{id:2,name:"Lorem Ipsum"}

Whether this list is a string[] array or a Group or Match collection doesn't matter, but I'm stumped as to how to get the objects split. Using the arrayParser and the json string declared above, I've tried this code which I assumed would work with no luck:

string json = "{items:[{id:0,name:\"Lorem Ipsum\"},{id:1,name" 
            + ":\"Lorem Ipsum\"},{id:2,name:\"Lorem Ipsum\"}]}";

Regex arrayFinder = new Regex(@"\{items:\[(?<items>[^\]]*)\]\}"
                                 , RegexOptions.ExplicitCapture);
Regex arrayParser = new Regex(@"((?<items>\{[^\}]\}),?)+"
                                 , RegexOptions.ExplicitCapture);

string array = arrayFinder.Match(json).Groups["items"].Value;
// At this point the 'array' variable contains: 
// {id:0,name:"Lorem Ipsum"},{id:1,name:"Lorem Ipsum"},{id:2,name:"Lorem Ipsum"}

// I would have expected one of these 2 lines to return 
// the array of matches I'm looking for
CaptureCollection c = arrayParser.Match(array).Captures;
GroupCollection g = arrayParser.Match(array).Groups;

Can anybody see what it is I'm doing wrong? I'm totally stuck on this.


回答1:


Balanced parentheses are literally a textbook example of a language that cannot be processed with regular expressions. JSON is essentially balanced parentheses plus a bunch of other stuff, with the braces replaced by parens. In the hierarchy of formal languages, JSON is a context-free language. Regular expressions can't parse context-free languages.

Some systems offer extensions to regular expressions that kinda-sorta handle balanced expressions. However they're all ugly hacks, they're all unportable, and they're all ultimately the wrong tool for the job.

In professional work, you would almost always use an existing JSON parser. If you want to roll your own for educational purposes then I'd suggest starting with a simple arithmetic grammar that supports + - * / ( ). (JSON has some escaping rules which, while not complex, will make your first attempt harder than it needs to be.) Basically, you'll need to:

  1. Decompose the language into an alphabet of symbols
  2. Write a context-free grammar in terms of those symbols thatrecognizes the language
  3. Convert the grammar into Chomsky normal form, or near enough to make step 5 easy
  4. Write a lexer that converts raw text into your input alphabet
  5. Write a recursive descent parser that takes your lexer's output, parses it, and produces some kind of output

This is a typical third-year CS assignment at just about any university.

The next step is to find out how complex a JSON string you need to trigger a stack overflow in your recursive parser. Then look at the other types of parsers that can be written, and you'll understand why anyone who has to parse a context-free language in the real world uses a tool like yacc or antlr instead of writing a parser by hand.

If that's more learning than you were looking for then you should feel free to go use an off-the-shelf JSON parser, satisified that you learned something important and useful: the limits of regular expressions.




回答2:


Balanced parentheses are literally a textbook example of a language that cannot be processed with regular expressions

bla bla bla ... check this out:

arrayParser = "(?<Key>[\w]+)":"?(?<Value>([\s\w\d\.\\\-/:_]+(,[,\s\w\d\.\\\-/:_]+)?)+)"?

this works for me

if you want to match empty values change last '+' to '*'




回答3:


Are you using .NET 3.5? If so, you can use the DataContractJsonSerializer to parse this out. There is no reason to do this yourself.

If you are not using .NET 3.5, you can use Jayrock.




回答4:


public Dictionary<string, string> ParseJSON(string s)
{
    Regex r = new Regex("\"(?<Key>[\\w]*)\":\"?(?<Value>([\\s\\w\\d\\.\\\\\\-/:_\\+]+(,[,\\s\\w\\d\\.\\\\\\-/:_\\+]*)?)*)\"?");
    MatchCollection mc = r.Matches(s);

    Dictionary<string, string> json = new Dictionary<string, string>();

    foreach (Match k in mc)
    {
        json.Add(k.Groups["Key"].Value, k.Groups["Value"].Value);

    }
    return json;
}

This function implement Lukasz regular expression. I only add to inclide + char to value group (because i am using that to parse live connect auth token)




回答5:


JSON cannot typically be parsed with regular expressions (certain extremely simplified variants of JSON can, but then they are not JSON but something else).

You need an actual parser to properly parse JSON.

And anyway, why are you trying to parse JSON at all? There are numerous libraries out there which can do it for you, and much better than your code would. Why reinvent the wheel, when there's a wheel factory around the corner with the words FOSS over the door?



来源:https://stackoverflow.com/questions/408570/regular-expression-to-parse-an-array-of-json-objects

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!