How do I robustly parse malformed CSV?

后端 未结 3 2111
盖世英雄少女心
盖世英雄少女心 2020-12-08 16:29

I\'m processing data from government sources (FEC, state voter databases, etc). It\'s inconsistently malformed, which breaks my CSV parser in all sorts of delightful ways.

3条回答
  •  佛祖请我去吃肉
    2020-12-08 17:08

    First, here is a rather naive attempt: http://rubular.com/r/gvh3BJaNTc

    /"(.*?)"(?=[\r\n,]|$)|([^,"\s].*?)(?=[\r\n,]|$)/m
    

    The assumptions here are:

    • A field may start with quotes. In which case, it should end with a quote that is either:
      • before a comma
      • before a new line (if it is last field on its line)
      • before the end of the file (if it is last field on the last line)
    • Or, its first character is not a quote, so it contains characters until the same condition as before is met.

    This almost does what you want, but fails on these fields:

    1 comma and
    linebreaks"
    

    As TC had pointed out in the comments, your text is ambiguous. I'm sure you already know it, but for completeness:

    • "a" - is that a or "a"? How do you represent a value that you want to be wrapped in quotes?
    • "1","2" - might be parsed as 1,2, or as 1","2 - both are legal.
    • ,1 \n 2, - End of line, or newline in the value? You cannot tell, specially if this is supposed to be the last value of its line.
    • 1 \n 2 \n 3 - One value with newlines? Two values (1\n2,3 or 1,2\n3)? Three values?

    You may be able to get some clues if you examine the first value on each row, which as you have said, should tell you the number of columns and their types - this can give you the additional information you are missing to parse the file (for example, if you know there should another field in this line, then all newlines belong in the current value). Even then though, it looks like there are serious problems here...

提交回复
热议问题