How do I robustly parse malformed CSV?

后端未结

关注

 3  2112

盖世英雄少女心 2020-12-08 16:29

I\'m processing data from government sources (FEC, state voter databases, etc). It\'s inconsistently malformed, which breaks my CSV parser in all sorts of delightful ways.

3条回答

暗喜 (楼主)

2020-12-08 17:15
It is possible to subclass Ruby's File to process each line of the the CSV file before it is passed to the Ruby's CSV parser. For example, here's how I used this trick to replace non-standard backslash-escaped quotes \" with standard double-quotes ""
```
class MyFile < File
  def gets(*args)
    line = super
    if line != nil
      line.gsub!('\\"','""')  # fix the \" that would otherwise cause a parse error
    end
    line
  end
end

infile = MyFile.open(filename)
incsv = CSV.new(infile)

while row = incsv.shift
  # process each row here
end
```
You could in principle do all sorts of additional processing, e.g. UTF-8 cleanups. The nice thing about this approach is you handle the file on a line by line basis, so you don't need to load it all into memory or create an intermediate file.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...