How do I robustly parse malformed CSV?

后端 未结 3 2112
盖世英雄少女心
盖世英雄少女心 2020-12-08 16:29

I\'m processing data from government sources (FEC, state voter databases, etc). It\'s inconsistently malformed, which breaks my CSV parser in all sorts of delightful ways.

3条回答
  •  暗喜
    暗喜 (楼主)
    2020-12-08 17:15

    It is possible to subclass Ruby's File to process each line of the the CSV file before it is passed to the Ruby's CSV parser. For example, here's how I used this trick to replace non-standard backslash-escaped quotes \" with standard double-quotes ""

    class MyFile < File
      def gets(*args)
        line = super
        if line != nil
          line.gsub!('\\"','""')  # fix the \" that would otherwise cause a parse error
        end
        line
      end
    end
    
    infile = MyFile.open(filename)
    incsv = CSV.new(infile)
    
    while row = incsv.shift
      # process each row here
    end
    

    You could in principle do all sorts of additional processing, e.g. UTF-8 cleanups. The nice thing about this approach is you handle the file on a line by line basis, so you don't need to load it all into memory or create an intermediate file.

提交回复
热议问题