I\'m processing data from government sources (FEC, state voter databases, etc). It\'s inconsistently malformed, which breaks my CSV parser in all sorts of delightful ways.>
It is possible to subclass Ruby's File to process each line of the the CSV file before it is passed to the Ruby's CSV parser. For example, here's how I used this trick to replace non-standard backslash-escaped quotes \" with standard double-quotes ""
class MyFile < File
def gets(*args)
line = super
if line != nil
line.gsub!('\\"','""') # fix the \" that would otherwise cause a parse error
end
line
end
end
infile = MyFile.open(filename)
incsv = CSV.new(infile)
while row = incsv.shift
# process each row here
end
You could in principle do all sorts of additional processing, e.g. UTF-8 cleanups. The nice thing about this approach is you handle the file on a line by line basis, so you don't need to load it all into memory or create an intermediate file.