I am currently writing a csv parser. The definition of csv format is given by RFC4180 which is defined by ABNF. So the definition of csv is absolutely a contex-free grammar. However, I would like to know if csv is regular grammar? So that I could parse it with just a finite state machine. Furthermore, if it is exactly a regular grammar and can be parsed by finite state machine, does that mean it can be also parsed by regular expression?
I don't have any formal theory available to verify this, but I'm pretty sure CSV files can reliably be parsed with regular expressions. It's probably best to use two regexes, though:
- One regex to match an entire CSV row (including linebreaks in quoted fields)
- Another regex (to be used on the match result of the first one) to match single fields
(unless you're using the .NET regex engine which provides access to individual captures of a repeating capturing group, or unless you know the number of columns in your CSV file beforehand and hard-code that into your regex).
A PCRE regex to match an entire CSV row could be:
/^(?:(?:[^",\r\n]*|"(?:""|[^"]*)*+")(?:,|$))*+(?=$)/m
You need to use the /m modifier here to allow ^ and $ to match newlines. If you're processing the file line by line, then the regex will fail on a line that's not a complete CSV row (i. e. where a quoted field hasn't been closed yet), so you would need to read the next line, add it to your test string and reapply the regex (you can remove the /m modifier in this scenario). Repeat until it matches.
Once you have that row, you can use this regex to match each successive field:
/([^",\r\n]*|"(?:""|[^"]*)*+")(?:,|$)/
The match result here also contains the delimiter (, or newline), so the actual field's contents must be extracted from group 1. You will also need to process the surrounding and embedded quotes after the match.
Explanation:
^ # Start of line (/m modifier!)
(?: # Start of non-capturing group (to contain the entire line):
(?: # Start of non-capturing group (to contain a single field):
[^",\r\n]* # Either match a run of character except quotes, commas or newlines
| # or
" # Match a quoted field, starting with a quote, followed by
(?: # either...
"" # an escaped quote
| # or
[^"]* # anything that's not a quote
)*+ # repeated as often as possible, no backtracking allowed
" # Then match a closing quote
) # End of group (=field)
(?:,|$) # Match a delimiter or the end of the line
)*+ # repeated as often as possible, no backtracking allowed
(?=$) # Assert that we're now at the end of a line
There is no definite answer to this question because CSV is a very loose format. Among the CSV readers that I have observed both context-free and regular grammars are maintained. For example some readers would throw an exception if anything but a comma follows after the end of an enclosed value.
You should be able to parse CSV files with a simple finite-state machine. Or, to be more precise, with one of a large number of simple FSMs depending on the precise CSV format. (That doesn't mean it's a good idea. There are CSV parsing libraries which are much better at dealing with all the weird variants and unwritten rules of CSV files you might find in the wild.)
Here are some (untested) flex rules without good error-handling for the simplest CSV-variant:
fields are separated with ,
whitespace is not in any way special, except for unquoted newlines which separate records
fields which include ", , or newline characters must be quoted; any field may be quoted.
a " in a quoted field is represented as two " characters.
%%
int record = 1;
int field = 1;
[^",\n]*/[^"] { printf("Record %d Field %d: |%s|\n", record, field, yytext); }
[,] { ++field; }
[\n] { ++line; field = 1; }
["]([^"]|["]["]*)["]/[,\n] {
printf("Record %d Field %d: |%s|\n", record, field, yytext); }
. { printf("Something bad happened in record %d field %d\n",
record, field); }
That doesn't handle quoted strings properly (i.e., it doesn't strip the quotes or undouble doubled quotes).
The simplest way to handle quoted fields is with a start condition (which is still implemented as part of an FSM):
%x QUOTED
%%
int record = 1;
int field = 1;
[^",\n]*/[^"] { printf("Record %d Field %d: |%s|\n", record, field, yytext); }
[,] { ++field; }
[\n] { ++line; field = 1; }
["] { printf("Record %d Field %d: |", record, field); BEGIN(QUOTED); }
<QUOTED>[^"]* { printf("%s", yytext); }
<QUOTED>["]["] { putchar('"'); }
<QUOTED>["]/[,\n] { putchar('|'); putchar('\n'); BEGIN(INITIAL); }
<*>. { printf("Something bad happened in record %d field %d\n",
record, field); }
So the theory-based answer is No, the CSV file format is not a regular language (based on that RFC).
The main reason that it is not is based on this line from the specification:
Each line should contain the same number of fields throughout the file.
To formally prove that the file format is not a regular language, you would use the pumping lemma for regular languages.
Consider the string which is 2 lines and p columns (where p is the pumping length from the pumping lemma) where each cell is empty (so if p = 3, it would be ",,\n,,\n". In order to satisfy the condition that |xy| <= p and |y| > 1, then "y" must be 1 or more commas in the first line of the file. If you then "pump" the y, then you will have more cells on your first line then your second. Therefore, it is not a regular language.
However, as is often the case, the theoretical answer is likely not what you really need. For one, many regular expression syntaxes (and finite state machine syntaxes) in many programming languages actually support more than true regular languages.
Also, just because you can't verify if a string truly conforms to the CSV spec with a true regular expression does not mean that you can't still parse it with one. You may just accept slightly malformed CSV files (such as ones that have uneven row lengths).
来源:https://stackoverflow.com/questions/24567071/is-csv-format-regular-grammar-or-context-free-grammar