Parsing an iCalendar file in C

问题

I am looking to parse iCalendar files using C. I have an existing structure setup and reading in all ready and want to parse line by line with components.

For example I would need to parse something like the following:

UID:uid1@example.com
DTSTAMP:19970714T170000Z
ORGANIZER;CN=John Doe;SENT-BY="mailto:smith@example.com":mailto:john.doe@example.com
CATEGORIES:Project Report, XYZ, Weekly Meeting
DTSTART:19970714T170000Z
DTEND:19970715T035959Z
SUMMARY:Bastille Day Party

Here are some of the rules:

The first word on each line is the property name
The property name will be followed by a colon (:) or a semicolon (;)
If it is a colon then the property value will be directly to the right of the content to the end of the line
A further layer of complexity is added here as a comma separated list of values are allowed that would then be stored in an array. So the CATEGORIES one for example would have 3 elements in an array for the values
If after the property name a semi colon is there, then there are optional parameters that follow
The optional parameter format is ParamName=ParamValue. Again a comma separated list is supported here.
There can be more than one optional parameter as seen on the ORGANIZER line. There would just be another semicolon followed by the next parameter and value.
And to throw in yet another wrench, quotations are allowed in the values. If something is in quotes for the value it would need to be treated as part of the value instead of being part of the syntax. So a semicolon in a quotation would not mean that there is another parameter it would be part of the value.

I was going about this using strchr() and strtok() and have got some basic elements from that, however it is getting very messy and unorganized and does not seem to be the right way to do this.

How can I implement such a complex parser with the standard C libraries (or the POSIX regex library)? (not looking for whole solution, just starting point)

回答1:

This answer is supposing that you want to roll your own parser using Standard C. In practice it is usually better to use an existing parser because they have already thought of and handled all the weird things that can come up.

My high level approach would be:

Read a line
Pass pointer to start of this line to a function parse_line:
- Use strcspn on the pointer to identify the location of the first : or ; (aborting if no marker found)
- Save the text so far as the property name
- While the parsing pointer points to ;:
  - Call a function extract_name_value_pair passing address of your parsing pointer.
  - That function will extract and save the name and value, and update the pointer to point to the ; or : following the entry. Of course this function must handle quote marks in the value and the fact that their might be ; or : in the value
- (At this point the parsing pointer is always on :)
- Pass the rest of the string to a function parse_csv which will look for comma-separated values (again, being aware of quote marks) and store the results it finds in the right place.

The functions parse_csv and extract_name_value_pair should in fact be developed and tested first. Make a test suite and check that they work properly. Then write your overall parser function which calls those functions as needed.

Also, write all the memory allocation code as separate functions. Think of what data structure you want to store your parsed result in. Then code up that data structure, and test it, entirely independently of the parsing code. Only then, write the parsing code and call functions to insert the resulting data in the data structure.

You really don't want to have memory management code mixed up with parsing code. That makes it exponentially harder to debug.

When making a function that accepts a string (e.g. all three named functions above, plus any other helpers you decide you need) you have a few options as to their interface:

Accept pointer to null-terminated string
Accept pointer to start and one-past-the-end
Accept pointer to start, and integer length

Each way has its pros and cons: it's annoying to write null terminators everywhere and then unwrite them later if need be; but it's also annoying when you want to use strcspn or other string functions but you received a length-counted piece of string.

Also, when the function needs to let the caller know how much text it consumed in parsing, you have two options:

Accept pointer to character, Return the number of characters consumed; calling function will add the two together to know what happened
Accept pointer to pointer to character, and update the pointer to character. Return value could then be used for an error code.

There's no one right answer, with experience you will get better at deciding which option leads to the cleanest code.

来源：https://stackoverflow.com/questions/35024861/parsing-an-icalendar-file-in-c

标签

regex

parsing

icalendar